Intro To Data Science
Intro To Data Science
Intro To Data Science
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
What you infer from this?
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Unit 1 • Data Types
• Introduction • Data Collections
• Data Science • Data Pre-processing
Techniques
• Data Science Relate to
Other Fields • Data Analysis &
Analytics
• The Relationship
between Data Science • Descriptive Analysis
and Information Science • Diagnostic Analytics
• Computational Thinking • Predictive Analytics
• Issues of Ethics, Bias, • Prescriptive Analytics
and Privacy in Data • Exploratory Analysis
Science • Mechanistic Analysis
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Science
• Frank Lo, the Director of Data Science at
Wayfair, “Data science is a multidisciplinary
blend of data inference, algorithm
development, and technology in order to
solve analytically complex problems.”
• data science as a field of study and practice
that involves the collection, storage, and
processing of data in order to derive
important insights into a problem or a
phenomenon.
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Science Process
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
DS Architecture
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
DS Life Cycle
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Science
• data may be generated by humans (surveys,
logs, etc.) or machines (weather data, road
vision, etc.), and
• could be in different formats (text, audio,
video, augmented or virtual reality, etc.).
• Why is data science so important now?
“3V model”
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
3V model /Big Data
The speed at which data
is accumulated.
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Science
Sources for exponential growth of data
1. Social media activity,
2. mobile interactions,
3. server logs,
4. Realtime market feeds,
5. customer service records,
6. transaction details, and
7. information from existing databases combine to
create a rich and complex conglomeration of
information …………..
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Science
Where Do We See Data Science?
• Finance
• Public Policy -gain insights into citizen behaviours that
affect the quality of public life, including traffic, public
transportation, social welfare, community wellbeing, etc.
• Politics
• Healthcare
• Urban Planning
• Education
• Libraries - Online Public Access Catalogues (OPACs)
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Science
What do financial data scientists do?
• Through capturing and analyzing new sources of
data, building predictive models and running real-
time simulations of market events, they help the
finance industry obtain the information necessary to
make accurate predictions
• banks and other loan sanctioning institutions
=> can minimize the chance of loan defaults via
information such as customer profiling, past
expenditures, other essential variables that can be
used to analyze the probabilities of risk and default
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
How Does Data Science Relate to Other Fields?
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
The Relationship between Data Science & Information Science
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Computational Thinking
Computational thinking is using abstraction and
decomposition when attacking a large complex task
or designing a large complex system
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Computational Thinking - Example
• https://www.youtube.com/watch?v=qbnTZCj0
ugI
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Computational Thinking - Example
Find the sum of all numbers between 1 and 100
• Decompose
• Identify the patterns or trends within a
problem
• Identify specific similarities and differences
among similar problems to work towards the
solution
• Algorithm
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Computational Thinking - Example
• Find the sum of all numbers between 1 and
200
Decompose
No. of Pairs = 200/2=100
200+1 =201
199+2=201
Difference => Last No. – First No.
198+3=201…
101+100=201
(Similarity last No. +1
No. of Pairs = 200/2=100
Identify the Pattern
200+1 = 201
(Sum of the pair)
Sum of all numbers = Sum of the pair * No. of Pairs
=(200+1)*(200/2)=20100
https://www.youtube.com/watch?v=qbnTZCj0ugI
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Science
Skills for Data Science
1. willing to experiment,
2. proficiency in mathematical reasoning,
and
3. data literacy
Tools for Data Science
• Python, R, and SQL
• C, Java, PHP
• MATLAB....
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Science
Issues of Ethics, Bias, and Privacy in Data Science
• how, where, and why was the data collected?
Who collected it?
• What did they intend to use it for?
• if the data was collected from people, did
these people know?
• Eg Facebook and Google have collected enormous
amounts of data about and from their users in order
not only to improve and market their products, but
also to share and/or sell it to other entities for profit
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Collections
• Data Types
– structured data
– unstructured data
• Challenges with Unstructured Data
• Data Collections
1.Open Data
• freely available in a public domain
• without restrictions from copyright, patents
• UCI Machine Learning Repository
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Collections
Principles associated with open data
Public
Accessible
Described
Reusable
Complete
Timely
Managed Post-Release
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Collections
2.Social Media Data Application Programming
Interface (API) is used to collecting data from
social meadia
• Social media data - analyzed for research or
marketing purposes
3. Multimodal Data
– IoT
• Healthcare Applications
• Agriculture
• Industry
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Collections
Data Storage and Presentation
• comma-separated values (CSV)
• tab-separated values (TSV)
• XML (eXtensible Markup Language)
• RSS (Really Simple Syndication)
– Information provided by a website in an XML file in such a
way is called an RSS feed.
– Since RSS data is small and fast loading, it can easily be
used with services such as mobile phones, personal digital
assistants (PDAs), and smart watches.
– RSS is useful for websites that are updated frequently
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Collections
Data Storage and Presentation
• JSON (JavaScript Object Notation)
– Key-value pair = In various languages, this is
realized as an object, record, structure, dictionary,
hash table, keyed list, or associative array.
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Pre-processing
What makes data “dirty”?
• Incomplete
– lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• Noisy
– e.g., Salary=“−10” (an error)
• Inconsistent
– inconsistent: containing discrepancies in codes or names,
– Age=“42”, Birthday=“03/07/2018”
– Grade “S,A,B,C,D,E,RA”, now rating “O,A+,A,B+,B,RA”
– discrepancy between duplicate records
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Pre-processing
Data Cleaning
Data Integration
Data Transformation
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Pre-processing
Reduction in number of Columns (Attributes) and
No. of rows (instances)
Data Reduction
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Pre-processing
Data Munging
“Add two diced tomatoes, three cloves of garlic,
and a pinch of salt in the mix.”
• Munging is done either manually, automatically,
or, in many cases, semi-automatically
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same
class: smarter
– the most probable value: inference-based such as Bayesian
formula or decision tree
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
How to Handle Noisy Data?
• Binning
– first sort data and partition into
(equal-frequency) bins
– then one can smooth by bin
means, smooth by bin median,
smooth by bin boundaries, etc.
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Integration
• Data integration:
– Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id B.cust-#
– Integrate metadata from different sources
• Entity identification problem:
– Identify real world entities from multiple data sources, e.g.,
Bill Clinton = William Clinton
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different
sources are different
– Possible reasons: different representations, different scales,
e.g., metric vs. British units
• Address redundant data in data integration
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
35
Data Transformation
• A function that maps the entire set of values of a given attribute to a new
set of replacement values s.t. each old value can be identified with one of
the new values
• Methods
– Smoothing: Remove noise from data
– Attribute/feature construction
• New attributes constructed from the given ones
– Aggregation: Summarization, data cube construction
– Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
– Generalization : Concept hierarchy climbing
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Normalization
Types
• Min-max normalization
• Z-score normalization
• Normalization by decimal scaling
Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
Then $73,000 is mapped to
73,600 12,000
(1.0 0) 0 0.716
98,000 12,000
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Normalization
• Z-score normalization (μ: mean, σ: standard deviation):
– Ex. Let μ = 54,000, σ = 16,000. Then
v A 73,600 54,000
v' 1.225
A
16,000
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Reduction
• Data Cube Aggregation
– Data cubes: They are multidimensional sets of data that can
be stored in a spreadsheet. A data cube could be in two,
three, or higher dimensions. Each dimension typically
represents an attribute of interest.
• Dimensionality Reduction
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Discretization
Divide the range of a continuous attribute into intervals
1. Marks converted to Grade
2. Age mapped to => Young, Adult
3. Range of temperature values => cold, moderate, and hot
Types of attributes
• Categorical variables
– Nominal: Values from an unordered set (Colors, Blood Groups,
Gender)
– Ordinal: Values from an ordered set (academic rank, Customer
satisfaction [Excellent, good...] )
• Continuous: Real numbers
• Ratio scaled : No. Male & females in a class 3:4
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Pre-processing
Data Cleaning
1. Smooth Noisy Data
2. Handling Missing
Data
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Pre-processing
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Pre-processing - Data Discretization
Discretize the wine
consumption per capita into
four categories
1. less than or equal to 1.00
per capita =>
(represented by 0),
2. more than 1.00 but less
than or equal to 2.00 per
capita (1),
3. more than 2.00 but less
than or equal to 5.00 per
capita (2), and
4. more than 5.00 per capita
(3).
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Analysis and Data Analytics
Analysis is the detailed examination of the elements or
structure of something.
“Analytics” is the systematic computational analysis of
data or statistics.
Data Analysis helps in understanding the data and
provides required insights from the past to
understand what happened so far
Data Analytics is the process of exploring the data from
the past to make appropriate decisions in the future
by using valuable insights
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Analysis & Data Analytics
Descriptive Analysis => reveal what happened in
the past
– Typically, it is the first kind of data analysis
performed on a dataset.
– Usually it is applied to large volumes of data, such
as census data.
– Description and interpretation processes are
different steps.
• Eg, to categorize customers by their likely product preferences
and purchasing patterns
• social media marketing campaign, use descriptive analytics to
assess the number of posts, mentions, followers, fans, page
views, reviews, or pins
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Analysis & Data Analytics
Descriptive Analysis
Type of Variable => categorical variable, Ordinal,
continuous variable, ratio ....
• Independent variable/ Predictor variable,
• Dependent variable / Outcome var/ Decision
var/ class label/ Target class
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
age income student credit_rating buys_computer
<=30
<=30
31…40
high
high
high
no
no
no
fair
excellent
fair
no
no
yes
Dataset
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Analysis & Data Analytics
Frequency Distribution
Histogram
Pie Chart
Distribution of Data
Normal & Skewed Distribution
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Analysis & Data Analytics
Skewed Distribution
Cricket Score
Exam Results – online vs offline; Lab v Theory
Average Income distribution
Human Life cycle
Taxation Regimes
Record of Long Jumps at a Competition
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Analysis & Data Analytics
Measures of Centrality
• Median, mean and mode of
symmetric, positively and negatively
skewed data
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Analysis & Data Analytics
Dispersion of a Distribution
• Range => largest score - smallest score.
Disadvantage: it uses only the highest and lowest
values, extreme scores or outliers tend to result in an
inaccurate picture of the more likely range.
• Interquartile range is defined as the difference
between the 25th and 75th percentile
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 – Q1
52
Boxplot Analysis
• Five-number summary of a distribution
– Minimum, Q1, Median, Q3, Maximum
3,40,41,45,40,60,61,62,63,65,70,99
53
Data Analysis & Data Analytics
• Variance and standard deviation (sample: s, population: σ)
– Variance: (algebraic, scalable computation)
– variance of a population (σ2 )
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Analysis & Data Analytics
Diagnostic Analytics
• used for discovery, or to determine why something
happened Eg “rain” vs “umbrella”
• Correlations - statistical analysis that is used to
measure and describe the strength and direction of
the relationship between two variables.
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Positively and Negatively Correlated Data
56
Uncorrelated Data
57
Data Analysis & Data Analytics
Diagnostic Analytics
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Analysis & Data Analytics
Find Correlation between the attributes
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Analysis & Data Analytics
Predictive Analytics
• understanding the future using the data and the
trends we have seen in the past
• no statistical algorithm can “predict” the future with
100% certainty because the foundation of predictive
analytics is based on probability
• predictive analytics software : SAS, IBM predictive
analytics, RapidMiner ....
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
62
age income student credit_rating buys_computer
<=30 high no fair no
no yes yes
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Analysis & Data Analytics
Prescriptive Analytics
• analyzes potential decisions, the interactions between
decisions, the influences that bear upon these decisions, and
the bearing all of this has on an outcome to ultimately
prescribe an optimal course of action
• the process of using current and historical data to identify
trends and relationships.
• Techniques include optimization, simulation, game
theory, and decision-analysis methods
• [Gartner] 13% of organizations are using predictive
analytics, but only 3% are using prescriptive
analytics.
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Analysis & Data Analytics
• Exploratory analysis is an approach to
analyzing datasets to find previously unknown
relationships.
• involves using various data visualization
approaches.
• exploratory analysis is about the methodology
or philosophy of doing the analysis, rather
than a specific technique
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Data Analysis & Data Analytics
Mechanistic Analysis
• understanding the exact changes in variables that
lead to changes in other variables for individual
objects(studying a relationship between two
variables)
• Regression => process for estimating the
relationships among variables
• Corelation vs Regression
• Correlation by itself does not provide any indication
of how one variable can be predicted from another.
But Regression provides
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Source: https://www.techtarget.com/searchcio/definition/Prescriptive-analytics
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Summary - Analytics
• Descriptive Analytics tells you what happened
in the past.
• Diagnostic Analytics helps you understand
why something happened in the past.
• Predictive Analytics predicts what is most
likely to happen in the future.
• Prescriptive Analytics recommends actions
you can take to affect those outcomes.
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
AI vs ML vs DL vs DS
AI, ML, DL and Data Science with Data Analysis, Data Analytics and Data
Mining - all based on the foundation of #BigData
Data Science - Scientific methods, algorithms and systems to extract
knowledge or insights from big data
• Also known as Predictive or Advanced Analytics
• Algorithmic and computational techniques and tools for handing large
data sets
• Increasingly focused on preparing and modeling data for ML & DL tasks
• Encompasses statistical methods, data manipulation and streaming
technologies (e.g. Spark, Hadoop)
• Key skill and tools behind building modern AI technologies
Data Analysis - Process of inspecting, cleansing, transforming and
modeling data
Data Analytics - Discovery, interpretation, and communication of
meaningful patterns in data
Data Mining - Process of discovering patterns in large data sets involving
methods at the intersection of machine learning, statistics, and
database systems
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Types of Data We Have
• Relational Data (Tables/Transaction/Legacy
Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data
• You can afford to scan the data once
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
What To Do With These Data?
• Aggregation and Statistics
– Data warehousing and OLAP
• Indexing, Searching, and Querying
– Keyword based search
– Pattern matching (XML/RDF)
• Knowledge discovery
– Data Mining
– Statistical Modeling
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC
Concentration in Data Science
• Mathematics and Applied Mathematics
• Applied Statistics/Data Analysis
• Solid Programming Skills (R, Python, Julia, SQL)
• Data Mining
• Data Base Storage and Management
• Machine Learning and discovery
• https://colab.research.google.com/drive/1kucNxA3sD
3A_qyZp9OwRi_V8HVkGsiOl#scrollTo=80zUqqGRuivN
8/27/2022 Dr S SHANTHI,ASP,CSE,KEC