EE2211 Introduction To Machine Learning: Semester 1 2021/2022

EE2211 Introduction to Machine
Learning
Lecture 2
Semester 1
2021/2022
Li Haizhou ([email protected])
Electrical and Computer Engineering Department

National University of Singapore
Acknowledgement:
EE2211 development team
(Thomas, Kar-Ann, Chen Khong, Helen, Robby and Haizhou)
EE2211 Introduction to Machine Learning 1

© Copyright EE, NUS. All Rights Reserved.
Course Contents
• Introduction and Preliminaries (Haizhou)
– Introduction
– Data Engineering
– Introduction to Probability and Statistics
• Fundamental Machine Learning Algorithms I (Kar-Ann / Helen)
– Systems of linear equations
– Least squares, Linear regression
– Ridge regression, Polynomial regression
• Fundamental Machine Learning Algorithms II (Thomas)
– Over-fitting, bias/variance trade-off
– Optimization, Gradient descent
– Decision Trees, Random Forest
• Performance and More Algorithms (Haizhou)
– Performance Issues
– K-means Clustering
– Neural Networks

Data Engineering
Module I Contents
• What is Machine Learning and Types of Learning
• How Supervised Learning works
• Regression and Classification Tasks
• Induction versus Deduction Reasoning
• Types of data
• Data wrangling and cleaning
• Data integrity and visualization
• Causality and Simpson’s paradox
• Random variable, Bayes’ rule
• Parameter estimation
• Parameters vs. Hyperparameters
3
Ask the right questions
if you are to find the right answers

Is artificial intelligence racist?
Racial and gender bias in AI
Joy Buolamwini: — Mit Lab Press Kit

Business Problem
Problem or (what problem we can solve
Questions with the data?)
Domain Domain
Knowledge Knowledge
Information Information
(examples, test (examples, test
cases) cases)
Data
(what data to be used to Data
solve the problem?)
Top-down Bottom-up
• The data may not contain the answer.
• The combination of some data and an aching desire for an answer
does not ensure that a reasonable answer can be extracted from a
given body of data. John Tukey

Types of Data
• Continuous What is data?
• Ordinal
• Categorical Numbers
• Missing
• Censored
Statistics
Text
Figures
Records Facts

Continuous, discrete variables
• Continuous variables are anything measured on a
quantitative scale that could be any fractional number.
– An example would be something like weight measured in kg.
• Discrete variables are numeric variables that have a
countable number of values between any two values.
Temperature
Time
Ordinal data
• Ordinal data are data that have a fixed, small number
(< 100) of possible values, called levels, that are
ordered (or ranked).
– Example: survey responses
Excellent Good Average Fair Poor
Ref: Book3, chapter 4.1.

Categorical data
• Categorical data are data where there are multiple
categories, but they are not ordered.
– Examples: gender, blood type, name of fruits and their production
regions etc.

Missing data
• Missing data are data that are missing and you do not
know the mechanism.
– You should use a single common code for all missing values (for
example, “NA”), rather than leaving any entries blank.
NUS student Age Gender

Olivia Tan 20 F
Hendra Setiawan 19 M
Ah Beng 19 NA

Censored data
• Censored data are data where you know the missing
mechanism on some level.
– Common examples are a measurement being below a detection
limit or a patient being lost to follow-up.
– They should also be coded as NA when you don’t have the data.
NUS student Age Gender

Olivia Tan 20 F
Hendra Setiawan 19 M
Ah Beng NA M

Data
Categorical/ Numerical/
Qualitative Quantitative
Nominal Ordinal Discrete Continuous
Ratio
Interval (Includes natural zero,
Ordinal (No natural zero, e.g., temperature in
Nominal (Can be ordered, e.g., e.g., temperature Kelvin)
(e.g., gender, small/medium/large) in Celsius)
religion) https://i.stack.imgur.com/J8Ged.jpg

From computational viewpoint
Nominal Ordinal Interval Ratio
Frequency distribution Yes Yes Yes Yes
Median and percentiles No Yes Yes Yes
Add or subtract No No Yes Yes
Mean, standard deviation No No Yes Yes
Ratio No No No Yes

Data Wrangling
• Data wrangling (cleaning + transform) is the process of
transforming and mapping data from one "raw" data
form into another format to make it more appropriate for
downstream analytics. (https://en.wikipedia.org/wiki/Data_wrangling)
• e.g. scaling, clipping, z-score (see next page)
• Data wrangling should not be performed blindly. We
should know the reason for wrangling and why the
need.
Model Visualize
Import Clean Transform Output

Example
• Scaling to a range
– When the bounds or range of each independent dimension of
data is known, a common normalization technique is min-max.
• Feature clipping
https://developers.google.com/machine-learning/data-prep/transform/normalization

Example
• Z-score standardization
– When the population of measurements of each independent
dimension of data is normally distributed where the parameters
are known, the standard score or z-score is a popular choice.
https://developers.google.com/machine-learning/data-prep/transform/normalization

Example (continuous vs discrete?)
Iris data set

• Measurement features
can be packed as
• Labels can be written as
https://archive.ics.uci.edu/ml/datasets/iris

Example (ordinal)
• For data of ordinal scale, the exact numerical value
has no significance over the ranking order that it
carries.
• Suppose the ranks are given by r = 1, ...,R. Then, the
ranks can be normalized into standardized distance
values (d) which fall within [0, 1] using
Excellent Good Average Fair Poor
r = 5 4 3 2 1

Example (categorical)
• Assign arbitrary numbers to represent the attributes.
– For example, one can assign a ‘1’ value for male and a ‘2’ value for
female for the case of gender attribute; ‘1’ value for spam and ‘0’
value for non-spam as in Lecture 1.
– However, the label which is assigned a higher value may have a

greater influence than the one with a lower value when extremely
large values and extremely small values are involved along the
computational process.

Example (categorical)
• Binary coding
– Common examples of binary coding schemes include the binary-
coded decimal (e.g., one-hot encoding), n-ary Gray codes.
– Sophisticated coding schemes take into account the probability
distribution of each attributes during conversion.
– One-hot encoding:

Data Cleaning
Data cleansing or data cleaning is the process of detecting and
correcting (or removing) corrupt or inaccurate records from a
record set, table, or database.
Example (missing features)

• Dealing with missing features
– Removing the examples with missing features from the dataset
(that can be done if your dataset is big enough so you can sacrifice
some training examples);
– Using a learning algorithm that can deal with missing feature
values (depends on the library and a specific implementation of the
algorithm);
– Using a data imputation technique.

Students Year of Gender Height GPA

Birth
Tan Ah Kow 1995 M 1.72 4.2
Ahmad Abdul X M 1.65 4.1
John Smith 1995 M 1.75 X
Chen Lulu 1995 F X 4.0
Raj Kumar 1995 M 1.73 4.5
Li Xiuxiu 1994 F 1.70 3.8

• Imputation
– Replace the missing value of a feature by an average value of this
feature in the dataset:
– Replace the missing value with a value outside the normal range of
values.
• For example, if the normal range is [0, 1], then you can set the missing
value to 2 or −1. The idea is that the learning algorithm will learn what is
best to do when the feature has a value significantly different from
regular values.
• Alternatively, you can replace the missing value by a value in the middle
of the range. For example, if the range for a feature is [−1, 1], you can
set the missing value to be equal to 0. Here, the idea is that the value in
the middle of the range will not significantly affect the prediction.

Data integrity
• Data integrity is the maintenance of, and the assurance

of the accuracy and consistency of, data over its
entire life-cycle.
(Ref: https://en.wikipedia.org/wiki/Data_integrity)
• It is a critical aspect to the design, implementation and

usage of any system which stores, processes, or
retrieves data.
– Physical integrity (error-correction codes, check-sum, redundancy)
– Logical integrity (product price is a positive, use drop-down list)

Data Integrity
http://www.financetwitter.com/
Data Visualization
• https://towardsdatascience.com/data-visualization-with-mathplotlib-using-python-a7bfb4628ee3

Example: Showing Distribution
(a) a probability mass function, (b) a probability density function

Caution: Same Statistics But Different Distribution
Anscombe's quartet (wiki)
You can explore data by
calculating summary
statistics, for example the
correlation between
variables.
However all of these data

sets have the exact same
correlation and regression
line
By Anscombe.svg: Schutz Derivative works of this file:(label using subscripts): Avenue - Anscombe.svg, CC BY-SA
3.0, https://commons.wikimedia.org/w/index.php?curid=9838454
• Data sets with identical means, variances and regression lines

Example: Boxplot
Michael Galarnyk, Understanding boxplot, 2018

Caution: Same Boxplots But Different Distribution
• These boxplots look very similar, but if you overlay the actual data points
you can see that they have very different distributions.

Example: Showing Composition/ Comparison
• If we make this pie chart as a bar chart it is much easier to see that A is bigger than D

Example: Log Scale
• Without logarithm 90% of the data are in the lower left-hand corner in this figure

The End


EE2211 Introduction To Machine Learning: Semester 1 2021/2022

Uploaded by

Copyright:

Available Formats

EE2211 Introduction To Machine Learning: Semester 1 2021/2022

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EE2211 Introduction To Machine Learning: Semester 1 2021/2022

Uploaded by

Copyright:

Available Formats

EE2211 Introduction to Machine

Electrical and Computer Engineering Department

EE2211 Introduction to Machine Learning 1

EE2211 Introduction to Machine Learning 2

EE2211 Introduction to Machine Learning 4

Joy Buolamwini: — Mit Lab Press Kit

EE2211 Introduction to Machine Learning 5

EE2211 Introduction to Machine Learning 7

EE2211 Introduction to Machine Learning 8

– Example: survey responses

Excellent Good Average Fair Poor

Ref: Book3, chapter 4.1.

EE2211 Introduction to Machine Learning 10

EE2211 Introduction to Machine Learning 11

NUS student Age Gender

EE2211 Introduction to Machine Learning 12

NUS student Age Gender

EE2211 Introduction to Machine Learning 13

Nominal Ordinal Discrete Continuous

EE2211 Introduction to Machine Learning 14

Frequency distribution Yes Yes Yes Yes

Median and percentiles No Yes Yes Yes

Add or subtract No No Yes Yes

Mean, standard deviation No No Yes Yes

EE2211 Introduction to Machine Learning 15

Import Clean Transform Output

EE2211 Introduction to Machine Learning 16

EE2211 Introduction to Machine Learning 17

EE2211 Introduction to Machine Learning 18

Iris data set

• Labels can be written as

EE2211 Introduction to Machine Learning 19

Excellent Good Average Fair Poor

EE2211 Introduction to Machine Learning 20

– However, the label which is assigned a higher value may have a

EE2211 Introduction to Machine Learning 21

EE2211 Introduction to Machine Learning 22

Example (missing features)

EE2211 Introduction to Machine Learning 23

Students Year of Gender Height GPA

EE2211 Introduction to Machine Learning 24

EE2211 Introduction to Machine Learning 25

• Data integrity is the maintenance of, and the assurance

• It is a critical aspect to the design, implementation and

EE2211 Introduction to Machine Learning 26

EE2211 Introduction to Machine Learning 28

(a) a probability mass function, (b) a probability density function

EE2211 Introduction to Machine Learning 29

However all of these data

• Data sets with identical means, variances and regression lines

EE2211 Introduction to Machine Learning 30

Michael Galarnyk, Understanding boxplot, 2018

EE2211 Introduction to Machine Learning 31

EE2211 Introduction to Machine Learning 32

EE2211 Introduction to Machine Learning 33

EE2211 Introduction to Machine Learning 34

EE2211 Introduction to Machine Learning 35

You might also like