EE2211 Introduction To Machine Learning: Semester 1 2021/2022

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

EE2211 Introduction to Machine

Learning
Lecture 2
Semester 1
2021/2022

Li Haizhou ([email protected])

Electrical and Computer Engineering Department


National University of Singapore

Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Chen Khong, Helen, Robby and Haizhou)

EE2211 Introduction to Machine Learning 1


© Copyright EE, NUS. All Rights Reserved.
Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Kar-Ann / Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Haizhou)
– Performance Issues
– K-means Clustering
– Neural Networks

EE2211 Introduction to Machine Learning 2


© Copyright EE, NUS. All Rights Reserved.
Data Engineering

Module I Contents
• What is Machine Learning and Types of Learning
• How Supervised Learning works
• Regression and Classification Tasks
• Induction versus Deduction Reasoning
• Types of data
• Data wrangling and cleaning
• Data integrity and visualization
• Causality and Simpson’s paradox
• Random variable, Bayes’ rule
• Parameter estimation
• Parameters vs. Hyperparameters

3
© Copyright EE, NUS. All Rights Reserved.
Ask the right questions
if you are to find the right answers

EE2211 Introduction to Machine Learning 4


© Copyright EE, NUS. All Rights Reserved.
Is artificial intelligence racist?
Racial and gender bias in AI

Joy Buolamwini: — Mit Lab Press Kit

EE2211 Introduction to Machine Learning 5


© Copyright EE, NUS. All Rights Reserved.
Business Problem
Problem or (what problem we can solve
Questions with the data?)

Domain Domain
Knowledge Knowledge

Information Information
(examples, test (examples, test
cases) cases)

Data
(what data to be used to Data
solve the problem?)

Top-down Bottom-up
EE2211 Introduction to Machine Learning 6
© Copyright EE, NUS. All Rights Reserved.
• The data may not contain the answer.
• The combination of some data and an aching desire for an answer
does not ensure that a reasonable answer can be extracted from a
given body of data. John Tukey

EE2211 Introduction to Machine Learning 7


© Copyright EE, NUS. All Rights Reserved.
Types of Data
• Continuous What is data?
• Ordinal
• Categorical Numbers
• Missing
• Censored
Statistics
Text

Figures
Records Facts

EE2211 Introduction to Machine Learning 8


© Copyright EE, NUS. All Rights Reserved.
Continuous, discrete variables
• Continuous variables are anything measured on a
quantitative scale that could be any fractional number.
– An example would be something like weight measured in kg.
• Discrete variables are numeric variables that have a
countable number of values between any two values.
Temperature

Time
EE2211 Introduction to Machine Learning 9
© Copyright EE, NUS. All Rights Reserved.
Ordinal data
• Ordinal data are data that have a fixed, small number
(< 100) of possible values, called levels, that are
ordered (or ranked).

– Example: survey responses

Excellent Good Average Fair Poor

Ref: Book3, chapter 4.1.

EE2211 Introduction to Machine Learning 10


© Copyright EE, NUS. All Rights Reserved.
Categorical data
• Categorical data are data where there are multiple
categories, but they are not ordered.
– Examples: gender, blood type, name of fruits and their production
regions etc.

EE2211 Introduction to Machine Learning 11


© Copyright EE, NUS. All Rights Reserved.
Missing data
• Missing data are data that are missing and you do not
know the mechanism.
– You should use a single common code for all missing values (for
example, “NA”), rather than leaving any entries blank.

NUS student Age Gender


Olivia Tan 20 F
Hendra Setiawan 19 M
Ah Beng 19 NA

EE2211 Introduction to Machine Learning 12


© Copyright EE, NUS. All Rights Reserved.
Censored data
• Censored data are data where you know the missing
mechanism on some level.
– Common examples are a measurement being below a detection
limit or a patient being lost to follow-up.
– They should also be coded as NA when you don’t have the data.

NUS student Age Gender


Olivia Tan 20 F
Hendra Setiawan 19 M
Ah Beng NA M

EE2211 Introduction to Machine Learning 13


© Copyright EE, NUS. All Rights Reserved.
Data

Categorical/ Numerical/
Qualitative Quantitative

Nominal Ordinal Discrete Continuous

Ratio
Interval (Includes natural zero,
Ordinal (No natural zero, e.g., temperature in
Nominal (Can be ordered, e.g., e.g., temperature Kelvin)
(e.g., gender, small/medium/large) in Celsius)
religion) https://i.stack.imgur.com/J8Ged.jpg

EE2211 Introduction to Machine Learning 14


© Copyright EE, NUS. All Rights Reserved.
From computational viewpoint
Nominal Ordinal Interval Ratio

Frequency distribution Yes Yes Yes Yes

Median and percentiles No Yes Yes Yes

Add or subtract No No Yes Yes

Mean, standard deviation No No Yes Yes

Ratio No No No Yes

EE2211 Introduction to Machine Learning 15


© Copyright EE, NUS. All Rights Reserved.
Data Wrangling
• Data wrangling (cleaning + transform) is the process of
transforming and mapping data from one "raw" data
form into another format to make it more appropriate for
downstream analytics. (https://en.wikipedia.org/wiki/Data_wrangling)
• e.g. scaling, clipping, z-score (see next page)
• Data wrangling should not be performed blindly. We
should know the reason for wrangling and why the
need.
Model Visualize

Import Clean Transform Output

EE2211 Introduction to Machine Learning 16


© Copyright EE, NUS. All Rights Reserved.
Example

• Scaling to a range
– When the bounds or range of each independent dimension of
data is known, a common normalization technique is min-max.
• Feature clipping

https://developers.google.com/machine-learning/data-prep/transform/normalization

EE2211 Introduction to Machine Learning 17


© Copyright EE, NUS. All Rights Reserved.
Example
• Z-score standardization
– When the population of measurements of each independent
dimension of data is normally distributed where the parameters
are known, the standard score or z-score is a popular choice.

https://developers.google.com/machine-learning/data-prep/transform/normalization

EE2211 Introduction to Machine Learning 18


© Copyright EE, NUS. All Rights Reserved.
Example (continuous vs discrete?)

Iris data set


• Measurement features
can be packed as

• Labels can be written as

https://archive.ics.uci.edu/ml/datasets/iris

EE2211 Introduction to Machine Learning 19


© Copyright EE, NUS. All Rights Reserved.
Example (ordinal)
• For data of ordinal scale, the exact numerical value
has no significance over the ranking order that it
carries.
• Suppose the ranks are given by r = 1, ...,R. Then, the
ranks can be normalized into standardized distance
values (d) which fall within [0, 1] using

Excellent Good Average Fair Poor

r = 5 4 3 2 1

EE2211 Introduction to Machine Learning 20


© Copyright EE, NUS. All Rights Reserved.
Example (categorical)
• Assign arbitrary numbers to represent the attributes.

– For example, one can assign a ‘1’ value for male and a ‘2’ value for
female for the case of gender attribute; ‘1’ value for spam and ‘0’
value for non-spam as in Lecture 1.

– However, the label which is assigned a higher value may have a


greater influence than the one with a lower value when extremely
large values and extremely small values are involved along the
computational process.

EE2211 Introduction to Machine Learning 21


© Copyright EE, NUS. All Rights Reserved.
Example (categorical)
• Binary coding
– Common examples of binary coding schemes include the binary-
coded decimal (e.g., one-hot encoding), n-ary Gray codes.
– Sophisticated coding schemes take into account the probability
distribution of each attributes during conversion.
– One-hot encoding:

EE2211 Introduction to Machine Learning 22


© Copyright EE, NUS. All Rights Reserved.
Data Cleaning
Data cleansing or data cleaning is the process of detecting and
correcting (or removing) corrupt or inaccurate records from a
record set, table, or database.

Example (missing features)


• Dealing with missing features
– Removing the examples with missing features from the dataset
(that can be done if your dataset is big enough so you can sacrifice
some training examples);
– Using a learning algorithm that can deal with missing feature
values (depends on the library and a specific implementation of the
algorithm);
– Using a data imputation technique.

EE2211 Introduction to Machine Learning 23


© Copyright EE, NUS. All Rights Reserved.
Example (missing features)

Students Year of Gender Height GPA


Birth
Tan Ah Kow 1995 M 1.72 4.2
Ahmad Abdul X M 1.65 4.1
John Smith 1995 M 1.75 X
Chen Lulu 1995 F X 4.0
Raj Kumar 1995 M 1.73 4.5
Li Xiuxiu 1994 F 1.70 3.8

EE2211 Introduction to Machine Learning 24


© Copyright EE, NUS. All Rights Reserved.
Example (missing features)
• Imputation
– Replace the missing value of a feature by an average value of this
feature in the dataset:

– Replace the missing value with a value outside the normal range of
values.
• For example, if the normal range is [0, 1], then you can set the missing
value to 2 or −1. The idea is that the learning algorithm will learn what is
best to do when the feature has a value significantly different from
regular values.
• Alternatively, you can replace the missing value by a value in the middle
of the range. For example, if the range for a feature is [−1, 1], you can
set the missing value to be equal to 0. Here, the idea is that the value in
the middle of the range will not significantly affect the prediction.

EE2211 Introduction to Machine Learning 25


© Copyright EE, NUS. All Rights Reserved.
Data integrity

• Data integrity is the maintenance of, and the assurance


of the accuracy and consistency of, data over its
entire life-cycle.
(Ref: https://en.wikipedia.org/wiki/Data_integrity)

• It is a critical aspect to the design, implementation and


usage of any system which stores, processes, or
retrieves data.
– Physical integrity (error-correction codes, check-sum, redundancy)
– Logical integrity (product price is a positive, use drop-down list)

EE2211 Introduction to Machine Learning 26


© Copyright EE, NUS. All Rights Reserved.
Data Integrity

http://www.financetwitter.com/
EE2211 Introduction to Machine Learning 27
© Copyright EE, NUS. All Rights Reserved.
Data Visualization

• https://towardsdatascience.com/data-visualization-with-mathplotlib-using-python-a7bfb4628ee3

EE2211 Introduction to Machine Learning 28


© Copyright EE, NUS. All Rights Reserved.
Example: Showing Distribution

(a) a probability mass function, (b) a probability density function

EE2211 Introduction to Machine Learning 29


© Copyright EE, NUS. All Rights Reserved.
Caution: Same Statistics But Different Distribution
Anscombe's quartet (wiki)
You can explore data by
calculating summary
statistics, for example the
correlation between
variables.

However all of these data


sets have the exact same
correlation and regression
line

By Anscombe.svg: Schutz Derivative works of this file:(label using subscripts): Avenue - Anscombe.svg, CC BY-SA
3.0, https://commons.wikimedia.org/w/index.php?curid=9838454

• Data sets with identical means, variances and regression lines

EE2211 Introduction to Machine Learning 30


© Copyright EE, NUS. All Rights Reserved.
Example: Boxplot

Michael Galarnyk, Understanding boxplot, 2018

EE2211 Introduction to Machine Learning 31


© Copyright EE, NUS. All Rights Reserved.
Caution: Same Boxplots But Different Distribution

• These boxplots look very similar, but if you overlay the actual data points
you can see that they have very different distributions.

EE2211 Introduction to Machine Learning 32


© Copyright EE, NUS. All Rights Reserved.
Example: Showing Composition/ Comparison

• If we make this pie chart as a bar chart it is much easier to see that A is bigger than D

EE2211 Introduction to Machine Learning 33


© Copyright EE, NUS. All Rights Reserved.
Example: Log Scale

• Without logarithm 90% of the data are in the lower left-hand corner in this figure

EE2211 Introduction to Machine Learning 34


© Copyright EE, NUS. All Rights Reserved.
The End

EE2211 Introduction to Machine Learning 35


© Copyright EE, NUS. All Rights Reserved.

You might also like