Module 1 - 1
Module 1 - 1
Module 1 - 1
1. Business Understanding.
1. Defining Research Goals:
2. proof of concepts, deliverables
and a measure of success
2. Data Acquisition or Data Collection or
Data Retrieval.
3. Data preparation: (data cleansing, data
integration data transformation).
4. Exploratory Data Analysis (EDA) or Data
exploration
5. Data Modelling or Model building:
6. Presentation and automation:
7. Deployment & Maintenance:
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Facets of data:
• The main categories of data are these:
• Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Structured
data:
• Structured data is data that depends on a data model and
resides in a fixed field within a record.
• Easy to store structured data in tables within databases or
Excel files or Structured Query Language (SQL).
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Unstructured data:
• Unstructured data is data that isn’t easy to fit into a data model
• The content is context-specific or varying.
Ex: E-mail
• Email contains structured elements such as the sender, title, and
body text . But It’s a challenge to find the number of people who
have written an email complaint about a specific employee
because so many ways exist to refer to a person.
• Even thousands of different languages and dialects.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Natural language:
• A human-written email is also a perfect example of natural
language data.
• Natural language is a special type of unstructured data;
• It’s challenging to process because it requires knowledge of
specific data science techniques and linguistics.
• Topics in NLP: entity recognition, topic recognition,
summarization, text completion, and sentiment analysis.
• Human language is ambiguous in nature.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Machine-generated data:
• Machine-generated data is
information that’s automatically
created by a computer, process,
application, or other machines
without human intervention.
• Machine-generated data is
becoming a major data resource.
• Examples of machine data are web
server logs, call detail records,
network event logs, and telemetry.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Graph-based or
network data:
• “Graph” in this case points to mathematical graph theory. In graph
theory, a graph is a mathematical structure to model pair-wise
relationships between objects.
• Graph or network data is, in short, data that focuses on the
relationship or adjacency of objects.
• The graph structures use nodes, edges, and properties to represent
and store graphical data.
• Graph-based data is a natural way to represent social networks,
and its structure allows you to calculate the shortest path between
two people.
• Graph-based data can be found on many social media websites.
EX: LinkedIn, Twitter, movie interests on Netflix
• Graph databases are used to store graph-based data and are
queried with specialized query languages such as SPARQL.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Audio, image, and video:
• Audio, image, and video are data types that pose specific challenges to a
data scientist.
• Recognizing objects in pictures, turn out to be challenging for computers.
• Major League Baseball Advanced Media - video capture to approximately 7
TB per game for the purpose of live, in-game analytics.
• High-speed cameras at stadiums will capture ball and athlete movements to
calculate in real time.
• DeepMind succeeded at creating an algorithm that’s capable of learning how
to play video games.
• This algorithm takes the video screen as input and learns to interpret
everything via a complex process of deep learning.
• Google – Artificial Intelligence Development plans
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Streaming data:
• The data flows into the system when an event happens instead of
being loaded into a data store in a batch.
• Examples are the “What’s trending” on Twitter, live sporting or
music events, and the stock market.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Preprocessing
• Data Preprocessing: An Overview
• Data Quality
• Major Tasks in Data Preprocessing
• Data Cleaning
• Data Integration
• Data Reduction
• Summary
26
26
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Quality: Why Preprocess the Data?
27
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
28
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Cleaning
• Data in the Real World Is Raw in Nature: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
29
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
• Remove duplicate or irrelevant observations (like NewZealand and
New Zealand)
• Structural errors are when you measure or transfer data and notice
strange naming conventions, typos, or incorrect capitalization
(example, you may find ―N/A and ―Not Applicable).
• Filter unwanted outliers
• Handle missing data
• In case, the values were not recorded, it becomes essential to fill up the
values by guessing based on the other values in that column and row. This is
called imputation (if a value is missing for gender).
• If a value is missing because it doesn't exist e.g. the height of the oldest
child of someone who doesn't have any children (Pandas function, fillna()
function to fill in missing values in a data frame).
• In case, it is not possible to figure out the reason for the missing, then that
particular value can be dropped. (Pandas function, dropna())
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Noisy Data:
• Random variance in the data is called noise. The following
methods called smoothening techniques are used to handle noisy
data:
• Binning Method:
• This method is also known as discretization, is used to smooth sorted
data values by consulting the values around it i.e., the neighbouring
values. It is a local smoothening method, since it refers to the
neighbouring values.
• In this method, the entire data is divided into equal segments called bins
or buckets.
• Smoothing by binning is done by one of the following methods:
• Smoothening by Bin Means:
• Smoothening by Bin Median:
• Smoothening by Bin Boundary:
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Example: Given data:18,22,6,6,9,14,20,21,12,18,18,16. Illustrate binning by mean,
median and boundary replacement. Given bin depth=3
Binning using Mean
• Step1: Sort the data: 6,6,9,12,14,16,18,18,18,20,21,22,
• Step 2: Partition the data into equal frequency bins of size of bin depth(n/d) where n= no. of
elements and d= bin depth
in this case N/D=12/3=4
Bin 1: 6, 6 , 9; Bin 2: 12,14, 16; Bin 3: 18,18,18; Bin 4: 20, 21, 22
• Step 3: Calculate Arithmetic mean : Bin 1= (6+6+9)/3=21/3=7; Bin 2= (12+14+16)/3= 42/3=14;
Bin 3=(18+18+18)/3= 54/3=18 Bin 4= (20+21+22)/3= 63/3=21
• Step 4: Replace each data element in each bin by the calculated mean
Bin 1: 7, 7, 7; Bin 2: 14,14,14; Bin 3: 18,18,18; Bin 4: 21, 21, 21
Binning using Median: In this method, Step 1 and step 2 are same.
• Step 3: Calculate Median (50% percentile) Median is the observation 2 in each bin
• Step 4: Replace each data element in each bin by the calculated mean
Bin 1: 6, 6, 6; Bin 2: 14,14,14; Bin 3: 18,18,18; Bin 4: 21, 21, 21
Binning using Boundary Values: In this, we keep the minimum as well as maximum values.
Bin 1: 6, 6, 9; Bin 2: 12,12,16; Bin 3: 18,18,18; Bin 4: 20, 20, 22
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Integration
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
34
34
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
The most common approaches to integrate
data are:
• Data Consolidation
• Data Propagation
• Data Virtualization
• Data Warehousing
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data consolidation
• Means to consolidate data from several separate sources into one
data store.
• The most common data consolidation techniques are ETL
(Extract, Transform, Load), Data virtualization and Data
warehousing.
• ETL: The data is first extracted from multiple sources, then it is
transformed into an understandable format by using various functions
like sorting, aggregating, cleaning etc and then transfer it to a centralized
store like another database or data warehouse.
• ETL is further of two types namely Real time ETL used in real time systems
and Batch processing ETL used for high volume databases.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
ETL Process
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Virtualization
• In the modern era, enterprise data comes in many
forms and is stored in many locations. There is
both structured and unstructured data, including
rows and columns of data in a traditional
database, and data in formats like logs, email, and
social media content. Big Data in its many forms is
stored in databases, log files, CRM, SaaS, and
other apps.
• It is a logical layer that amalgamates data from
various sources without
• performing actual ETL process. It is an abstraction
such that only the required data
• is visible to the users without requiring technical
details about the location or
• structure of the data source. It provides enhanced
data security.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
But data virtualization isn’t an answer to every data analytics
requirement. Depending on the use case, sometimes a
consolidated data warehouse with an ETL is a better solution - or
even a hybrid of both……………….
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Date warehousing
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Warehousing
• Χ2 (chi-square) test
(Observed − Expected ) 2
2 =
Expected
• The larger the Χ2 value, the more likely the variables are
related
• The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population
45
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Chi-Square Calculation: An Example