DSA5105 Lecture1

Principles of Machine Learning
DSA 5105 • Lecture 1
Soufiane Hayou
Department of Mathematics
Logistics
• Instructor
Soufiane Hayou (Department of Mathematics)
Office: S17-05-08
Email: [email protected]
• Notes, slides and code are/will be available on Canvas
• Assessment
• Homework (3 x 5%) + Project (15%) (Individual) [30%]
• Mid-term Test [20%]
• Final Exam [50%]
Logistics
Lectures
• F2F (LT28) + Live(Zoom) + Recording will be uploaded
• You are encouraged to turn on your camera (if you’re
following with Zoom)
• Active participation is encouraged.
• Consultation arrangement will be announced later
Introduction
Can machines think?
I propose to consider the question, “Can
machines think?”. This should begin with
definitions of the meaning of the terms
"machine" and "think." The definitions might
be framed so as to reflect so far as possible
the normal use of the words, but this attitude
is dangerous…
Alan Turing, 1950
Turing, A. M. “Computing machinery and intelligence”. Mind 49 433-460.

The
Imitation
Game
I believe that in about fifty years' time it will be possible, to

programme computers … to make them play the imitation game
so well that an average interrogator will not have more than 70
percent chance of making the right identification after five
minutes of questioning.
Alan M Turing, 1950
https://en.wikipedia.org/wiki/Turing_test#/media/File:Turing_test_diagram.png
From artificial intelligence to machine learning
Can machines think?
Can machines do what

thinking beings do?
How can machines

learn to do some things
that thinking beings do?
In this class, we are
interested in the study of
algorithmic approaches to
learning
A concrete definition of learning
A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P if its
performance at tasks in T, as measured by P, improves with
experience E.
Tom Mitchell
8 ÷ 2( 2+2)
16
calc(*args)
8 ÷ 2( 2+2)
calc(*args)
16
Topics
• Supervised learning
• Linear and nonlinear models
• Basic learning and approximation theory
• Learning/optimization algorithms
• Unsupervised learning
• Dimensional reduction, clustering and generative models
• Reinforcement learning
• Markov decision processes, reinforcement learning
algorithms
What this class is
• A (hopefully) gentle introduction to the exciting world of
machine learning
• A holistic view of the modern interplay of machine
learning with mathematics, statistics, computer science,
physical sciences and engineering
What this class isn’t

• A comprehensive survey of state-of-the-art machine
learning models and methods
• A “math class”
Preliminaries
Types of data (from our survey)
Math/Python background
Used ML before? Introductory Class? Expectations

Representing data in computers
Many data are numerical in nature
Other examples
• Video captures
• Financial time series
• Numerical measurements from experiments
What about general discrete data?
We make an important distinction
• Ordinal data
Data that has a natural notion of order, e.g.
• Star ratings of a product
• Level of language proficiency
• Letter grades of a class
• Nominal data
Data that has no order, e.g.
• Categories of image classification
• Answers to True/False questions
We need to embed these discrete data into something we can
represent on a computer, e.g. real/floating point numbers
The types of embedding depends on the nature of the data!
• Ordinal data
We want embedding to preserve this ordering, so
we typically use real numbers
• Nominal data
This is somewhat opposite -- we want embedding
to not introduce spurious ordering, e.g. one-hot
embedding
Classes of machine learning problems
Supervised Learning Unsupervised Learning Reinforcement Learning
Regression Clustering Value iteration
Classification Dimensional reduction Policy gradient
Function approximation Generative models Actor-critic
Inverse problems/design Anomaly detection Exploration
… … …
There are many intersections between them!

Evaluation and Selection using Data
In more quantitative terms, given a dataset , we split it into
• is called the training set, and it is used to train our machine

learning model
• is called the testing set, and it is used to evaluate the
performance of our model. We should not peek at this set
while training!
• An additional splitting into a validation set is sometimes used
to perform hyper-parameter tuning and model selection
Supervised Learning
What is supervised learning?
Supervised learning is simplest and most prevalent type of
machine learning problems
It is about learning to make predictions
Examples
• Image recognition
• Weather prediction
• Stock price prediction
• …
Given dataset:
Inputs: Outputs/labels: Data size:
Goal: learn the relationship from
𝑥1 =¿ Cat
𝑥 2=¿ (Oracle) Dog
The oracle can be

• Deterministic:
• Random: e.g.
Hypothesis space
The oracle is unknown to us, except through the dataset
The supervised learning approach:

1. Define a hypothesis space consisting of a set of
candidate functions, e.g.
2. Find the “best” function in that approximates

What you get depends on !
Curve fitting methods and the message they send. https://xkcd.com/2048/

What does best approximation mean?
Useful to define some loss function which is small if and large
otherwise. Then, we can find the best approximation by an
optimization problem
𝑦𝑖
This is called empirical risk minimization (ERM)
So, is learning just optimization?
We want to do well on unseen data! In other words, our model
must generalize.
What we can solve
Empirical risk minimization Generalization Gap
^
𝑓
Population risk minimization
≠
~
𝑓
What we really want to solve
Three paradigms of supervised learning
𝓗
∗
i on 𝑓
a t
oxim
ppr
~A
𝑓
n
tio
za
𝑓0
al i
er
n
Ge
^
𝑓
Optimization
(using )
Linear Models
Simple linear regression
This is the simplest case, where are all scalars

Step 1: Define hypothesis space
Step 2: Find best approximation
We need to define a loss function
Then, the empirical risk minimization problem is

Empirical risk minimization problem:
Solution:
and
Ordinary Least Squares

Formula (1D)
Approximation
Is the linear hypothesis space large enough?
This right figure is an instance of under-fitting

Overfitting and generalization
Polynomial hypothesis space:
Hypothesis space too big, so over-fitting can happen, with or

without noise!
The role of loss functions
So far, we only considered the
mean-square loss
There are many other choices, e.g.

the Huber loss
Mean square vs Huber loss in regression
We perform a linear regression on a noisy dataset with outliers.
What do you observe?
General linear basis models
The simple linear regression we have seen is quite limited
• Only for 1D inputs
• Can only fit linear relationships
It turns out that we can easily generalize the previous approach by

considering linear basis models
General linear basis models
Consider and the new hypothesis space
Each is called a basis function or feature map
Why is this a generalization?

• Take
• In general, can be large and ’s can be highly nonlinear,
but is linear in
Examples of basis functions
Some choices of basis functions in 1D
• Polynomial basis:
• Gaussian basis:
• Sigmoid basis: with
Ordinary least squares for linear basis models
The empirical risk minimization problem is now

We can rewrite
in compact form
( )( )()
𝜙0 ( 𝑥 1 ) ⋯ 𝜙 𝑀 −1 ( 𝑥 1) ∨¿ 𝜙 𝑀 −1 ( 𝑥 2) 𝑤0 𝑦1
⋯
Φ= 𝜙0 ( 𝑥 2 ) 𝑤= 𝑤1 𝑦 = 𝑦2
⋮ ⋱ ⋮ ⋮ ⋮
𝜙0( 𝑥𝑁 ) ⋯ 𝜙 𝑀 −1 ( 𝑥 𝑁 ) 𝑤 𝑀 −1 𝑦𝑁
We want to solve
We can do this by setting .
Suppose is invertible, then we have
Rearranging we have
General Ordinary
Least Squares
Formula
What happens if is not invertible, i.e. it is singular?

In the singular case, we have an infinite number of solutions, all
of which have . They are given by
Here, denotes the Moore-Penrose pseudoinverse of .
How do we pick a solution?

Regularization
Often, it is advantageous to consider the regularized least squares
problem
regularizer
Types of regularization
• regularization: (ridge regression)
• regularization: (least absolute shrinkage and selection
operator, or lasso)
• …
Regularization and generalization
We apply regularization on the over-fitting examples
Recall:
so , but
Without regularization With regularization

Classification using linear basis models
In -class classification problems, each takes on the class label of

one of classes.
We will use the one-hot encoding introduced earlier to represent

each that belongs to class as
kth Position
We require a slight change of hypothesis space
The function is called an activation function, and the most

commonly used one is the soft-max function
Notice that always outputs a vector which can be interpreted as

probabilities over classes
Everything else remain the same, and we can define the empirical
risk minimization problem for classification as
What loss function should we use? We can always use mean

square loss, but there is a better choice: the cross-entropy loss
Demo:
Applications of Linear Models
A Cautionary Note on Correlation
More examples at https://www.tylervigen.com/spurious-correlations

Summary
1. Machine learning vs AI
2. Types of Learning Problems
• Supervised learning
• Unsupervised learning
• Reinforcement learning
3. Linear models as a baseline for supervised learning
Useful Tools
Version control with Git
• https://www.freecodecamp.org/news/what-is-git-and-how
-to-use-it-c341b049ae61/
Interactive python with Jupyter notebooks
• https://www.datacamp.com/community/tutorials/tutorial-j
upyter-notebook
Data visualization using Seaborn and Pandas
• https://jakevdp.github.io/PythonDataScienceHandbook/0
4.14-visualization-with-seaborn.html
Further Reading
Matrix Cookbook
• https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbo
ok.pdf
More on linear models (Pattern Recognition and Machine
Learning, Bishop)
• http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%2
0-%20Pattern%20Recognition%20And%20Machine%20
Learning%20-%20Springer%20%202006.pdf

DSA5105 Lecture1

Uploaded by

Copyright:

Available Formats

DSA5105 Lecture1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSA5105 Lecture1

Uploaded by

Copyright:

Available Formats

Principles of Machine Learning

DSA 5105 • Lecture 1

Alan Turing, 1950

Turing, A. M. “Computing machinery and intelligence”. Mind 49 433-460.

I believe that in about fifty years' time it will be possible, to

Can machines think?

Can machines do what

How can machines

What this class isn’t

Used ML before? Introductory Class? Expectations

There are many intersections between them!

• is called the training set, and it is used to train our machine

It is about learning to make predictions

𝑥 2=¿ (Oracle) Dog

The oracle can be

The supervised learning approach:

2. Find the “best” function in that approximates

Curve fitting methods and the message they send. https://xkcd.com/2048/

This is the simplest case, where are all scalars

Step 2: Find best approximation

We need to define a loss function

Then, the empirical risk minimization problem is

Ordinary Least Squares

This right figure is an instance of under-fitting

Hypothesis space too big, so over-fitting can happen, with or

There are many other choices, e.g.

It turns out that we can easily generalize the previous approach by

Each is called a basis function or feature map

Why is this a generalization?

The empirical risk minimization problem is now

We can do this by setting .

Suppose is invertible, then we have

What happens if is not invertible, i.e. it is singular?

Here, denotes the Moore-Penrose pseudoinverse of .

How do we pick a solution?

Without regularization With regularization

In -class classification problems, each takes on the class label of

We will use the one-hot encoding introduced earlier to represent

The function is called an activation function, and the most

Notice that always outputs a vector which can be interpreted as

What loss function should we use? We can always use mean

More examples at https://www.tylervigen.com/spurious-correlations

You might also like