Module 1 - 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Dr.

Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal


Introduction to Data Science
UNIT-1
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Science……..
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Science Vs Analysis Vs Software Delivery
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Contrast: Scientific Computing
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Contrast: Machine Learning
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Contrast: Data Engineering
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Science: Case Study
Cancer Research
• Cancer is an incredibly complex disease; a single tumor can have more than
100 billion cells, and each cell can acquire mutations individually. The
disease is always changing, evolving, and adapting.
• Employ the power of big data analytics and high-performance computing.
• Leverage sophisticated pattern and machine learning algorithms to identify
patterns that are potentially linked to cancer
• Huge amount of data processing and recognition
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Science: Case Study
Health Care
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Science: Case Study
Internet of Things (IoT)
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Science: Case Study
Customer Analytics
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Topics
• Benefits,
• Uses and facets of data,
• Data Science Process:
• overview,
• retrieving data,
• Data preparation,
• Exploratory Data analysis,
• Basic Statistical descriptions of Data.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
From Data to Data Science
• Enormous growth of data has been seen since 2010, smart phones,
wearables, Internet of things, etc.
• Social media sites like Facebook, YouTube, twitter etc,
• Concept of Big data, which is a term used for collection of large and
complex data sets.
• Until 2010, the main focus was building framework and solutions to
store data, which was successfully solved by HADOOP and other
frameworks.
• In the present time, it has become difficult to process this large and
complex Data, hence Data Science involves using methods to analyze
massive amounts of data and extract the knowledge from the raw data.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
DATA GROWTH ISSUES AND CHALLENGES
• Big Data is characterized by five V‘s namely
• Volume: enormous size of data
• Variety: nature of data (structured, semi-structured and unstructured)
• Velocity: high speed of accumulation of data
• Veracity: quality of data (Inconsistencies and uncertainty)
• Variability: data originated from different sources
• Value: No value unless you turn it into something useful
• Consequently, the challenges these characteristics bring are
being seen in data capture, curation, storage, search, sharing,
transfer, and visualization.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
TOOLS FOR DATA SCIENCE

• Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio,


MATLAB, Excel, RapidMiner.
• Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS
Redshift
• Data Visualization tools: R, Jupyter, Tableau, Cognos.
• Machine learning tools: Spark, Mahout, Azure ML studio.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Science Process
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Science
Process

1. Business Understanding.
1. Defining Research Goals:
2. proof of concepts, deliverables
and a measure of success
2. Data Acquisition or Data Collection or
Data Retrieval.
3. Data preparation: (data cleansing, data
integration data transformation).
4. Exploratory Data Analysis (EDA) or Data
exploration
5. Data Modelling or Model building:
6. Presentation and automation:
7. Deployment & Maintenance:
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Facets of data:
• The main categories of data are these:
• Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Structured
data:
• Structured data is data that depends on a data model and
resides in a fixed field within a record.
• Easy to store structured data in tables within databases or
Excel files or Structured Query Language (SQL).
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Unstructured data:
• Unstructured data is data that isn’t easy to fit into a data model
• The content is context-specific or varying.
Ex: E-mail
• Email contains structured elements such as the sender, title, and
body text . But It’s a challenge to find the number of people who
have written an email complaint about a specific employee
because so many ways exist to refer to a person.
• Even thousands of different languages and dialects.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Natural language:
• A human-written email is also a perfect example of natural
language data.
• Natural language is a special type of unstructured data;
• It’s challenging to process because it requires knowledge of
specific data science techniques and linguistics.
• Topics in NLP: entity recognition, topic recognition,
summarization, text completion, and sentiment analysis.
• Human language is ambiguous in nature.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Machine-generated data:

• Machine-generated data is
information that’s automatically
created by a computer, process,
application, or other machines
without human intervention.
• Machine-generated data is
becoming a major data resource.
• Examples of machine data are web
server logs, call detail records,
network event logs, and telemetry.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Graph-based or
network data:
• “Graph” in this case points to mathematical graph theory. In graph
theory, a graph is a mathematical structure to model pair-wise
relationships between objects.
• Graph or network data is, in short, data that focuses on the
relationship or adjacency of objects.
• The graph structures use nodes, edges, and properties to represent
and store graphical data.
• Graph-based data is a natural way to represent social networks,
and its structure allows you to calculate the shortest path between
two people.
• Graph-based data can be found on many social media websites.
EX: LinkedIn, Twitter, movie interests on Netflix
• Graph databases are used to store graph-based data and are
queried with specialized query languages such as SPARQL.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Audio, image, and video:
• Audio, image, and video are data types that pose specific challenges to a
data scientist.
• Recognizing objects in pictures, turn out to be challenging for computers.
• Major League Baseball Advanced Media - video capture to approximately 7
TB per game for the purpose of live, in-game analytics.
• High-speed cameras at stadiums will capture ball and athlete movements to
calculate in real time.
• DeepMind succeeded at creating an algorithm that’s capable of learning how
to play video games.
• This algorithm takes the video screen as input and learns to interpret
everything via a complex process of deep learning.
• Google – Artificial Intelligence Development plans
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Streaming data:
• The data flows into the system when an event happens instead of
being loaded into a data store in a batch.
• Examples are the “What’s trending” on Twitter, live sporting or
music events, and the stock market.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Preprocessing
• Data Preprocessing: An Overview
• Data Quality
• Major Tasks in Data Preprocessing

• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary
26
26
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Quality: Why Preprocess the Data?

• Measures for data quality: A multidimensional view


• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?

27
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Major Tasks in Data Preprocessing

• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
28
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Cleaning
• Data in the Real World Is Raw in Nature: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?

29
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
• Remove duplicate or irrelevant observations (like NewZealand and
New Zealand)
• Structural errors are when you measure or transfer data and notice
strange naming conventions, typos, or incorrect capitalization
(example, you may find ―N/A and ―Not Applicable).
• Filter unwanted outliers
• Handle missing data
• In case, the values were not recorded, it becomes essential to fill up the
values by guessing based on the other values in that column and row. This is
called imputation (if a value is missing for gender).
• If a value is missing because it doesn't exist e.g. the height of the oldest
child of someone who doesn't have any children (Pandas function, fillna()
function to fill in missing values in a data frame).
• In case, it is not possible to figure out the reason for the missing, then that
particular value can be dropped. (Pandas function, dropna())
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Noisy Data:
• Random variance in the data is called noise. The following
methods called smoothening techniques are used to handle noisy
data:
• Binning Method:
• This method is also known as discretization, is used to smooth sorted
data values by consulting the values around it i.e., the neighbouring
values. It is a local smoothening method, since it refers to the
neighbouring values.
• In this method, the entire data is divided into equal segments called bins
or buckets.
• Smoothing by binning is done by one of the following methods:
• Smoothening by Bin Means:
• Smoothening by Bin Median:
• Smoothening by Bin Boundary:
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Example: Given data:18,22,6,6,9,14,20,21,12,18,18,16. Illustrate binning by mean,
median and boundary replacement. Given bin depth=3
Binning using Mean
• Step1: Sort the data: 6,6,9,12,14,16,18,18,18,20,21,22,
• Step 2: Partition the data into equal frequency bins of size of bin depth(n/d) where n= no. of
elements and d= bin depth
in this case N/D=12/3=4
Bin 1: 6, 6 , 9; Bin 2: 12,14, 16; Bin 3: 18,18,18; Bin 4: 20, 21, 22
• Step 3: Calculate Arithmetic mean : Bin 1= (6+6+9)/3=21/3=7; Bin 2= (12+14+16)/3= 42/3=14;
Bin 3=(18+18+18)/3= 54/3=18 Bin 4= (20+21+22)/3= 63/3=21
• Step 4: Replace each data element in each bin by the calculated mean
Bin 1: 7, 7, 7; Bin 2: 14,14,14; Bin 3: 18,18,18; Bin 4: 21, 21, 21
Binning using Median: In this method, Step 1 and step 2 are same.
• Step 3: Calculate Median (50% percentile) Median is the observation 2 in each bin
• Step 4: Replace each data element in each bin by the calculated mean
Bin 1: 6, 6, 6; Bin 2: 14,14,14; Bin 3: 18,18,18; Bin 4: 21, 21, 21
Binning using Boundary Values: In this, we keep the minimum as well as maximum values.
Bin 1: 6, 6, 9; Bin 2: 12,12,16; Bin 3: 18,18,18; Bin 4: 20, 20, 22
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Integration
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store

• Schema integration: e.g., A.cust-id  B.cust-#


• Integrate metadata from different sources

• Entity identification problem:


• Identify real world entities from multiple data sources, e.g., Bill Clinton = William
Clinton

• Detecting and resolving data value conflicts


• For the same real world entity, attribute values from different sources are different
• Possible reasons: different representations, different scales, e.g., metric vs. British units

34
34
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
The most common approaches to integrate
data are:
• Data Consolidation
• Data Propagation
• Data Virtualization
• Data Warehousing
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data consolidation
• Means to consolidate data from several separate sources into one
data store.
• The most common data consolidation techniques are ETL
(Extract, Transform, Load), Data virtualization and Data
warehousing.
• ETL: The data is first extracted from multiple sources, then it is
transformed into an understandable format by using various functions
like sorting, aggregating, cleaning etc and then transfer it to a centralized
store like another database or data warehouse.
• ETL is further of two types namely Real time ETL used in real time systems
and Batch processing ETL used for high volume databases.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
ETL Process
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Virtualization
• In the modern era, enterprise data comes in many
forms and is stored in many locations. There is
both structured and unstructured data, including
rows and columns of data in a traditional
database, and data in formats like logs, email, and
social media content. Big Data in its many forms is
stored in databases, log files, CRM, SaaS, and
other apps.
• It is a logical layer that amalgamates data from
various sources without
• performing actual ETL process. It is an abstraction
such that only the required data
• is visible to the users without requiring technical
details about the location or
• structure of the data source. It provides enhanced
data security.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
But data virtualization isn’t an answer to every data analytics
requirement. Depending on the use case, sometimes a
consolidated data warehouse with an ETL is a better solution - or
even a hybrid of both……………….
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Date warehousing
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Warehousing

• Data Warehousing is the


integration of data from
multiple sources to a
centralized source to
facilitate decision
making, reporting and
query handling.
• A centralized source of
data enables better
decision making.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Propagation (Data Replication)
• It involves copying data from one location i.e., source to another location i.e.,
target location. It is event driven. These applications usually operate online and
push data to the target location.
• They are majorly useful for real time data movement such as workload
balancing, backup and recovery. Data propagation can be done asynchronously
or synchronously.
• The two methods for data propagation are:
• Enterprise Application Integration (EAI) and
• Enterprise Data Replication (EDR).
• The key advantage of data propagation is that it can be used for real-time / near-
realtime data movement and can also be used for workload balancing, backup
and recovery.
• EAI is used majorly for the exchange of messages and transactions in real-time business
transaction processing; whereas for transfer of voluminous data between databases,
Enterprise Data Replication is used.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
• For example, you may have a cloud-based application that needs to
access customer data stored in an on-premise database. Data
propagation allows you to replicate the data from the on-premise
system to the cloud, allowing your application to access it without
having to query the original source directly.

In the example here, you can see


how Cleo's EAI system connects
external-facing applications like
Amazon Vendor Portal, Shopify,
and Magento, along with EDI
trading partners like Walmart and
Target, to the back-end ERP
system, which is Acumatica.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Handling Redundancy in Data Integration

• Redundant data occur often when integration of multiple


databases
• Object identification: The same attribute or object may
have different names in different databases
• Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
44
44
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Correlation Analysis (Nominal Data)

• Χ2 (chi-square) test
(Observed − Expected ) 2
2 = 
Expected
• The larger the Χ2 value, the more likely the variables are
related
• The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population

45
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)


Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

• Χ2 (chi-square) calculation (numbers in parenthesis are


expected counts calculated based on the data distribution in
the two categories)
(250 − 90) 2 (50 − 210 ) 2 (200 − 360 ) 2 (1000 − 840 ) 2
 =
2
+ + + = 507 .93
90 210 360 840
• It shows that like_science_fiction and play_chess are
correlated in the group
46
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Transformation
• After the data has been acquired, it is cleaned as discussed above. The clean
data may not be in a standard format. The process of changing the structure
and format of data to make it more usable is called data transformation.
• Data transformation may be constructive, destructive, aesthetic or structural.
• Constructive Data Transformation involves adding, copying, and replicating data.
• Destructive Data Transformation involves deleting fields and records.
• Aesthetic Data Transformation involves standardizing salutations or street names.
• Structural Data Transformation involves renaming, moving, and combining columns in
a database.
• Data Transformation may require smoothing, aggregation, discretization,
attribute Construction, generalization and normalization to make data
manageable.
• Scripting languages like Python or domain-specific languages like SQL are
usually used for data transformation.
Dr. Abhishek Bhatt, Faculty, Centre for AI, MITS-DU, Gwal
Data Reduction Strategies

• Data reduction: Obtain a reduced representation of the data set that is


much smaller in volume but yet produces the same (or almost the same)
analytical results
• Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the
complete data set.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
48
• Data compression

You might also like