DSA5105 Lecture1

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 51

Principles of Machine Learning

DSA 5105 • Lecture 1

Soufiane Hayou
Department of Mathematics
Logistics
• Instructor
Soufiane Hayou (Department of Mathematics)
Office: S17-05-08
Email: [email protected]
• Notes, slides and code are/will be available on Canvas
• Assessment
• Homework (3 x 5%) + Project (15%) (Individual) [30%]
• Mid-term Test [20%]
• Final Exam [50%]
Logistics
Lectures
• F2F (LT28) + Live(Zoom) + Recording will be uploaded
• You are encouraged to turn on your camera (if you’re
following with Zoom)
• Active participation is encouraged.
• Consultation arrangement will be announced later
Introduction
Can machines think?
I propose to consider the question, “Can
machines think?”. This should begin with
definitions of the meaning of the terms
"machine" and "think." The definitions might
be framed so as to reflect so far as possible
the normal use of the words, but this attitude
is dangerous…

Alan Turing, 1950

Turing, A. M. “Computing machinery and intelligence”. Mind 49 433-460.


The
Imitation
Game

I believe that in about fifty years' time it will be possible, to


programme computers … to make them play the imitation game
so well that an average interrogator will not have more than 70
percent chance of making the right identification after five
minutes of questioning.
Alan M Turing, 1950
https://en.wikipedia.org/wiki/Turing_test#/media/File:Turing_test_diagram.png
From artificial intelligence to machine learning

Can machines think?

Can machines do what


thinking beings do?

How can machines


learn to do some things
that thinking beings do?
In this class, we are
interested in the study of
algorithmic approaches to
learning
A concrete definition of learning
A computer program is said to learn from experience E with
respect to some class of tasks T and performance measure P if its
performance at tasks in T, as measured by P, improves with
experience E.
Tom Mitchell

8 ÷ 2( 2+2)
16
calc(*args)

8 ÷ 2( 2+2)
calc(*args)
16
Topics
• Supervised learning
• Linear and nonlinear models
• Basic learning and approximation theory
• Learning/optimization algorithms
• Unsupervised learning
• Dimensional reduction, clustering and generative models
• Reinforcement learning
• Markov decision processes, reinforcement learning
algorithms
What this class is
• A (hopefully) gentle introduction to the exciting world of
machine learning
• A holistic view of the modern interplay of machine
learning with mathematics, statistics, computer science,
physical sciences and engineering

What this class isn’t


• A comprehensive survey of state-of-the-art machine
learning models and methods
• A “math class”
Preliminaries
Types of data (from our survey)
Math/Python background

Used ML before? Introductory Class? Expectations


Representing data in computers
Many data are numerical in nature

Other examples
• Video captures
• Financial time series
• Numerical measurements from experiments
What about general discrete data?
We make an important distinction
• Ordinal data
Data that has a natural notion of order, e.g.
• Star ratings of a product
• Level of language proficiency
• Letter grades of a class
• Nominal data
Data that has no order, e.g.
• Categories of image classification
• Answers to True/False questions
We need to embed these discrete data into something we can
represent on a computer, e.g. real/floating point numbers
The types of embedding depends on the nature of the data!
• Ordinal data
We want embedding to preserve this ordering, so
we typically use real numbers

• Nominal data
This is somewhat opposite -- we want embedding
to not introduce spurious ordering, e.g. one-hot
embedding
Classes of machine learning problems
Supervised Learning Unsupervised Learning Reinforcement Learning
Regression Clustering Value iteration
Classification Dimensional reduction Policy gradient
Function approximation Generative models Actor-critic
Inverse problems/design Anomaly detection Exploration
… … …

There are many intersections between them!


Evaluation and Selection using Data
In more quantitative terms, given a dataset , we split it into

• is called the training set, and it is used to train our machine


learning model
• is called the testing set, and it is used to evaluate the
performance of our model. We should not peek at this set
while training!
• An additional splitting into a validation set is sometimes used
to perform hyper-parameter tuning and model selection
Supervised Learning
What is supervised learning?
Supervised learning is simplest and most prevalent type of
machine learning problems

It is about learning to make predictions

Examples
• Image recognition
• Weather prediction
• Stock price prediction
• …
Given dataset:
Inputs: Outputs/labels: Data size:
Goal: learn the relationship from

𝑥1 =¿ Cat

𝑥 2=¿ (Oracle) Dog

The oracle can be


• Deterministic:
• Random: e.g.
Hypothesis space
The oracle is unknown to us, except through the dataset

The supervised learning approach:


1. Define a hypothesis space consisting of a set of
candidate functions, e.g.

2. Find the “best” function in that approximates


What you get depends on !

Curve fitting methods and the message they send. https://xkcd.com/2048/


What does best approximation mean?
Useful to define some loss function which is small if and large
otherwise. Then, we can find the best approximation by an
optimization problem

𝑦𝑖
This is called empirical risk minimization (ERM)
So, is learning just optimization?
We want to do well on unseen data! In other words, our model
must generalize.
What we can solve
Empirical risk minimization Generalization Gap

^
𝑓
Population risk minimization


~
𝑓
What we really want to solve
Three paradigms of supervised learning

𝓗

i on 𝑓
a t
oxim
ppr
~A
𝑓

n
tio
za
𝑓0

al i
er
n
Ge
^
𝑓
Optimization
(using )
Linear Models
Simple linear regression

This is the simplest case, where are all scalars


Step 1: Define hypothesis space

Step 2: Find best approximation

We need to define a loss function

Then, the empirical risk minimization problem is


Empirical risk minimization problem:

Solution:
and

Ordinary Least Squares


Formula (1D)
Approximation
Is the linear hypothesis space large enough?

This right figure is an instance of under-fitting


Overfitting and generalization
Polynomial hypothesis space:

Hypothesis space too big, so over-fitting can happen, with or


without noise!
The role of loss functions
So far, we only considered the
mean-square loss

There are many other choices, e.g.


the Huber loss
Mean square vs Huber loss in regression
We perform a linear regression on a noisy dataset with outliers.
What do you observe?
General linear basis models
The simple linear regression we have seen is quite limited
• Only for 1D inputs
• Can only fit linear relationships

It turns out that we can easily generalize the previous approach by


considering linear basis models
General linear basis models
Consider and the new hypothesis space

Each is called a basis function or feature map

Why is this a generalization?


• Take
• In general, can be large and ’s can be highly nonlinear,
but is linear in
Examples of basis functions
Some choices of basis functions in 1D
• Polynomial basis:
• Gaussian basis:
• Sigmoid basis: with
Ordinary least squares for linear basis models

The empirical risk minimization problem is now


We can rewrite

in compact form

( )( )()
𝜙0 ( 𝑥 1 ) ⋯ 𝜙 𝑀 −1 ( 𝑥 1) ∨¿ 𝜙 𝑀 −1 ( 𝑥 2) 𝑤0 𝑦1

Φ= 𝜙0 ( 𝑥 2 ) 𝑤= 𝑤1 𝑦 = 𝑦2
⋮ ⋱ ⋮ ⋮ ⋮
𝜙0( 𝑥𝑁 ) ⋯ 𝜙 𝑀 −1 ( 𝑥 𝑁 ) 𝑤 𝑀 −1 𝑦𝑁
We want to solve

We can do this by setting .

Suppose is invertible, then we have

Rearranging we have
General Ordinary
Least Squares
Formula

What happens if is not invertible, i.e. it is singular?


In the singular case, we have an infinite number of solutions, all
of which have . They are given by

Here, denotes the Moore-Penrose pseudoinverse of .

How do we pick a solution?


Regularization
Often, it is advantageous to consider the regularized least squares
problem

regularizer
Types of regularization
• regularization: (ridge regression)
• regularization: (least absolute shrinkage and selection
operator, or lasso)
• …
Regularization and generalization
We apply regularization on the over-fitting examples

Recall:
so , but

Without regularization With regularization


Classification using linear basis models

In -class classification problems, each takes on the class label of


one of classes.

We will use the one-hot encoding introduced earlier to represent


each that belongs to class as

kth Position
We require a slight change of hypothesis space

The function is called an activation function, and the most


commonly used one is the soft-max function

Notice that always outputs a vector which can be interpreted as


probabilities over classes
Everything else remain the same, and we can define the empirical
risk minimization problem for classification as

What loss function should we use? We can always use mean


square loss, but there is a better choice: the cross-entropy loss
Demo:
Applications of Linear Models
A Cautionary Note on Correlation

More examples at https://www.tylervigen.com/spurious-correlations


Summary
1. Machine learning vs AI
2. Types of Learning Problems
• Supervised learning
• Unsupervised learning
• Reinforcement learning
3. Linear models as a baseline for supervised learning
Useful Tools
Version control with Git
• https://www.freecodecamp.org/news/what-is-git-and-how
-to-use-it-c341b049ae61/
Interactive python with Jupyter notebooks
• https://www.datacamp.com/community/tutorials/tutorial-j
upyter-notebook
Data visualization using Seaborn and Pandas
• https://jakevdp.github.io/PythonDataScienceHandbook/0
4.14-visualization-with-seaborn.html
Further Reading
Matrix Cookbook
• https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbo
ok.pdf
More on linear models (Pattern Recognition and Machine
Learning, Bishop)
• http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%2
0-%20Pattern%20Recognition%20And%20Machine%20
Learning%20-%20Springer%20%202006.pdf

You might also like