MachineLearning PDF

‫ﺗﻨﻈﯿﻢ‬
www.qunaieer.com
‫طﺮﯾﻘﺔ اﻟﻤﺤﺎﺿﺮة‬
‫• ﺷرح اﻟﻣﺻطﻠﺣﺎت واﻟﻣﻔﺎھﯾم اﻷﺳﺎﺳﯾﺔ‬
‫• اﻟﻣﺻطﻠﺣﺎت وأﻏﻠب ﺷراﺋﺢ اﻟﻌرض ﺑﺎﻹﻧﺟﻠﯾزي واﻟﺷرح ﺑﺎﻟﻌرﺑﻲ‬
‫• ﺷرح ﻣﻔﺻل ﻟﻸﺳﺎﺳﯾﺎت‬
‫• إﺷﺎرات ﺳرﯾﻌﺔ ﻟﻠﺧوارزﻣﯾﺎت اﻟﺷﮭﯾرة‬
‫• أﻣﺛﻠﺔ ﻋﻣﻠﯾﺔ ﺑﺳﯾطﺔ‬
‫• ﺧطﺔ ﺗﻌﻠم ﻣﻘﺗرﺣﺔ‬
‫ﻣﺎ ھﻮ ﺗﻌﻠﻢ اﻵﻟﺔ؟‬
‫• ھو ﻋﻠم ﯾﻣﻛن اﻟﺣﺎﺳب اﻟﺗﻌﻠم ﻣن ﻧﻔﺳﮫ ﺑدﻻً ﻣن ﺑرﻣﺟﺗﮫ ﺑﺎﻟﺗﻔﺻﯾل‬
‫• اﺧﺗزال ﺟوھر اﻟﺑﯾﺎﻧﺎت ﻋن طرﯾﻖ ﺑﻧﺎء ﻧﻣﺎذج )‪ ،(models‬واﺗﺧﺎذ اﻟﻘرارات واﻟﺗوﻗﻌﺎت‬
‫اﻟﻣﺳﺗﻘﺑﻠﯾﺔ ﺑﻧﺎ ًء ﻋﻠﯾﮭﺎ‬
Regression
•Regression analysis is a statistical
process for estimating the relationships
among variables
•Used to predict continuous outcomes
Regression Examples
Linear Regression
y
Line Equation
𝑦𝑦 = 𝑏𝑏 + 𝑎𝑎𝑎𝑎
price
intercept slope
𝑦𝑦� = 𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥

square meter x Model/hypothesis
Linear Regression
y 𝑦𝑦� = 𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥
Price (*1000)
example
𝑤𝑤0 =50, 𝑤𝑤1 =1.8,
x=500
𝑦𝑦� = 950
square meter x
Linear Regression
y How to quantify
error?
price
square meter x
Linear Regression
Residual Sum of Squares (RSS)
𝑁𝑁
2
𝑅𝑅𝑅𝑅𝑅𝑅 𝑤𝑤0 , 𝑤𝑤1 = �(𝑦𝑦�𝑖𝑖 − 𝑦𝑦𝑖𝑖 )
𝑖𝑖=1
Where 𝑦𝑦�𝑖𝑖 = 𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥𝑖𝑖
Cost function
Linear Regression
y How to choose best
model?
Choose w0 and w1
price
that give lowest RSS

value = Find The Best
Line
square meter x
Optimization
Image from https://ccse.lbl.gov/Research/Optimization/index.html

Optimization (convex)
derivative < 0 derivative > 0
The gradient points in the direction

of the greatest rate of increase of
the function, and its magnitude is
the slope of the graph in that
derivative = 0 direction. - Wikipedia
𝑅𝑅𝑅𝑅𝑅𝑅
𝑤𝑤1
𝑤𝑤0
Image from http://codingwiththomas.blogspot.com/2012/09/particle-swarm-optimization.html

• First, lets compute the gradient of our cost function
𝑁𝑁
−2 ∑𝑖𝑖=1[𝑦𝑦�𝑖𝑖 − 𝑦𝑦𝑖𝑖 ]
𝛻𝛻𝑅𝑅𝑅𝑅𝑅𝑅(𝑤𝑤0 , 𝑤𝑤1 ) =
−2 ∑𝑁𝑁 𝑖𝑖=1[𝑦𝑦
�𝑖𝑖 − 𝑦𝑦𝑖𝑖 ]𝑥𝑥𝑖𝑖
Where 𝑦𝑦�𝑖𝑖 = 𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥𝑖𝑖
• To find best lines, there are two ways:
• Analytical (normal equation)
• Iterative (gradient descent)
Gradient Descent
𝑡𝑡+1 𝑡𝑡
𝒘𝒘 = 𝒘𝒘 − 𝜂𝜂𝜂𝜂𝜂𝜂𝜂𝜂𝜂𝜂
𝑡𝑡+1 𝑡𝑡 𝑁𝑁
𝑤𝑤0 𝑤𝑤0 ∑
−2 𝑖𝑖=1[𝑦𝑦�𝑖𝑖 − 𝑦𝑦𝑖𝑖 ]
𝑡𝑡+1 = 𝑡𝑡 − 𝜂𝜂 𝑁𝑁
𝑤𝑤1 𝑤𝑤1 −2 ∑𝑖𝑖=1[𝑦𝑦�𝑖𝑖 − 𝑦𝑦𝑖𝑖 ]𝑥𝑥𝑖𝑖
Update the weights to minimize the cost function

𝜂𝜂 is the step size (important hyper-parameter)
Gradient Descent
Image from wikimedia.org

Linear Regression: Algorithm
• Objective: min 𝐽𝐽(𝑤𝑤0 , 𝑤𝑤1 ) , here 𝐽𝐽 𝑤𝑤0 , 𝑤𝑤1 = 𝑅𝑅𝑅𝑅𝑅𝑅(𝑤𝑤0 , 𝑤𝑤1 )
𝑤𝑤0 ,𝑤𝑤1
• Initialize 𝑤𝑤0 , 𝑤𝑤1 , e.g. random numbers or zeros
• 𝑓𝑓𝑓𝑓𝑓𝑓 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐:
• Compute the gradient: 𝛻𝛻𝐽𝐽
𝑤𝑤0
• 𝑊𝑊 𝑡𝑡+1 = 𝑊𝑊 𝑡𝑡 − 𝜂𝜂𝜂𝜂𝐽𝐽, where W = 𝑤𝑤1
The Essence of
Model/hypothesis
Machine Learning
𝑦𝑦� = 𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥
Cost function
𝑁𝑁
2
𝑅𝑅𝑅𝑅𝑅𝑅 𝑤𝑤0 , 𝑤𝑤1 = �(𝑦𝑦𝑖𝑖 −𝑦𝑦�𝑖𝑖 )
𝑖𝑖=1
Optimization
𝒘𝒘𝑡𝑡+1 = 𝒘𝒘𝑡𝑡 − 𝜂𝜂𝜂𝜂𝜂𝜂𝜂𝜂𝜂𝜂
Linear Regression: Multiple features
• Example: for house pricing, in addition to size in square meters, we
can use city, location, number of rooms, number of bathrooms, etc
• The model/hypothesis becomes
𝑦𝑦� = 𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥 1 + 𝑤𝑤2 𝑥𝑥2 + ⋯ + 𝑤𝑤𝑛𝑛 𝑥𝑥𝑛𝑛

𝑤𝑤𝑒𝑒ℎ𝑟𝑟𝑟𝑟 𝑛𝑛 = 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓
Representation
• Vector representation of 𝑛𝑛 features
𝑦𝑦� = 𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥 1 + 𝑤𝑤2 𝑥𝑥2 + ⋯ + 𝑤𝑤𝑛𝑛 𝑥𝑥𝑛𝑛
𝑥𝑥0 = 1 𝑤𝑤0 𝑥𝑥0
𝑥𝑥1 𝑤𝑤1 𝑥𝑥1
𝒙𝒙 = 𝑥𝑥2 𝒘𝒘 = 𝑤𝑤2 𝑦𝑦� = 𝑤𝑤0 𝑤𝑤1 𝑤𝑤2 ⋯ 𝑤𝑤𝑛𝑛 𝑥𝑥2
⋮ ⋮
𝑥𝑥𝑛𝑛 𝑤𝑤𝑛𝑛 ⋮
𝑥𝑥𝑛𝑛
𝑇𝑇
𝑦𝑦� = 𝒘𝒘 𝒙𝒙
Representation
• matrix representation of 𝑚𝑚 data samples and 𝑛𝑛 features
(𝑖𝑖) (𝑖𝑖) (𝑖𝑖) (𝑖𝑖)
𝑦𝑦� = 𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥1 + 𝑤𝑤2 𝑥𝑥2 + ⋯+ 𝑤𝑤𝑛𝑛 𝑥𝑥𝑛𝑛
𝑖𝑖 is the 𝑖𝑖 th data sample
(0)
𝑥𝑥0
(1)
(0)
𝑥𝑥1
(1)
(0)
⋯ 𝑥𝑥𝑛𝑛
(1)
𝑤𝑤0
𝑤𝑤1 � = 𝑋𝑋𝒘𝒘
𝒚𝒚
𝑋𝑋 = 𝑥𝑥0 𝑥𝑥1 ⋯ 𝑥𝑥𝑛𝑛 𝒘𝒘 = 𝑤𝑤2 𝑦𝑦� (0) 𝑥𝑥0
(0) (0)
𝑥𝑥1 (0) 𝑤𝑤0
⋮ ⋮ ⋱ ⋮ ⋮ ⋯ 𝑥𝑥𝑛𝑛
(𝑚𝑚)
𝑥𝑥0 𝑥𝑥1
(𝑚𝑚) (𝑚𝑚)
𝑤𝑤𝑛𝑛 𝑦𝑦� (1) (1) (1) 𝑤𝑤1
… 𝑥𝑥𝑛𝑛 𝑥𝑥 𝑥𝑥1 ⋯
(1)
𝑥𝑥𝑛𝑛
𝑦𝑦� (2) = 0 × 𝑤𝑤2
⋮ ⋮ ⋱ ⋮ ⋮
Size: m x n Size: n x 1 ⋮ (𝑚𝑚) (𝑚𝑚) (𝑚𝑚)
(𝑚𝑚) 𝑥𝑥 𝑥𝑥1 … 𝑥𝑥𝑛𝑛 𝑤𝑤𝑛𝑛
𝑦𝑦� 0
Analytical solution (normal equation)
𝑇𝑇 −1 𝑇𝑇
𝒘𝒘 = (𝑋𝑋 𝑋𝑋) 𝑋𝑋 𝒚𝒚
Analytical vs. Gradient Descent
• Gradient descent: must select parameter 𝜂𝜂
• Analytical solution: no parameter selection
• Gradient descent: a lot of iterations

• Analytical solution: no need for iterations
• Gradient descent: works with large number of features

• Analytical solution: slow with large number of features
Demo
• Matrices operations
• Simple linear regression implementation
• Scikit-learn library’s linear regression
Classification
classification
x2
x1
Classification Examples
Logistic Regression
• How to turn regression problem into classification one?
• y = 0 or 1
• Map values to [0 1] range
1
𝑔𝑔 𝑥𝑥 =
1 + 𝑒𝑒 −𝑥𝑥
Sigmoid/Logistic Function
Logistic Regression
• Model (sigmoid\logistic function)
𝑇𝑇
1
ℎ𝒘𝒘 𝒙𝒙 = 𝑔𝑔 𝒘𝒘 𝒙𝒙 = −𝒘𝒘 𝑇𝑇 𝒙𝒙
1+ 𝑒𝑒
• Interpretation (probability)
ℎ𝒘𝒘 𝒙𝒙 = 𝑝𝑝 𝑦𝑦 = 1|𝒙𝒙; 𝒘𝒘
𝑖𝑖𝑖𝑖 ℎ𝒘𝒘 𝒙𝒙 ≥ 0.5 ⇒ 𝑦𝑦 = 1
𝑖𝑖𝑖𝑖ℎ𝒘𝒘 𝒙𝒙 < 0.5 ⇒ 𝑦𝑦 = 0
Logistic Regression
x2 Decision
Boundary
x1
Logistic Regression
• Cost function
𝐽𝐽 𝒘𝒘 = 𝑦𝑦 log(ℎ𝒘𝒘 𝒙𝒙 ) + 1 − 𝑦𝑦 log(1 − ℎ𝒘𝒘 (𝒙𝒙))

1
ℎ𝒘𝒘 𝒙𝒙 = 𝑔𝑔 𝒘𝒘𝑇𝑇 𝒙𝒙 = −𝒘𝒘 𝑇𝑇 𝒙𝒙
1+ 𝑒𝑒
Logistic Regression
• Optimization: Gradient Descent
• Exactly like linear regression
• Find best w parameters that minimize the cost
function
Logistic Regression: Algorithm
• Objective: min 𝐽𝐽(𝑤𝑤0 , 𝑤𝑤1 ) , here 𝐽𝐽 𝑤𝑤0 , 𝑤𝑤1 is the logistic regression
𝑤𝑤0 ,𝑤𝑤1
cost function
• Initialize 𝑤𝑤0 , 𝑤𝑤1 , e.g. random numbers or zeros
• 𝑓𝑓𝑓𝑓𝑓𝑓 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐:
• Compute the gradient: 𝛻𝛻𝐽𝐽 (not discussed here)
𝑤𝑤0
• 𝒘𝒘𝑡𝑡+1 = 𝒘𝒘𝑡𝑡 − 𝜂𝜂𝜂𝜂𝐽𝐽, where 𝐰𝐰 = 𝑤𝑤1
Logistic Regression
• Multi-class classification
Logistic Regression
One-vs-All
Logistic Regression: Multiple Features
x2
x3
x2
x1
x1
Line - Plane - Hyperplane
Demo
• Scikit-learn library’s logistic regression
Other Classification Algorithms
Neural Networks
input hidden output

Support Victor Machines (SVM)
x2
x1
Decision Trees
Credit?
excellent poor
fair
Safe Term? Incom?
3 year 5 year high low
Risky Safe Term? Risky
3 year 5 year
Risky Safe
K Nearest Neighbors (KNN)
x2
5 Nearest Neighbors
x1
Clustering
Clustering
• Unsupervised learning
• Group similar items into clusters
• K-Mean algorithm
Clustering Examples
K-Mean
x2
x1
K-Mean
x2
x1
K-Mean
x2
x1
K-Mean
x2
x1
K-Mean
x2
x1
K-Mean
x2
x1
K-Mean
x2
x1
K-Mean
x2
x1
K-Mean
x2
x1
K-Mean Algorithm
• Select number of clusters: 𝐾𝐾 (number of centroids 𝜇𝜇1 , … , 𝜇𝜇𝐾𝐾 )
• Given dataset of size N
• 𝑓𝑓𝑜𝑜𝑜𝑜 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑡𝑡:
• 𝑓𝑓𝑓𝑓𝑓𝑓 𝑖𝑖 = 1 𝑡𝑡𝑡𝑡 𝑁𝑁:
• 𝑐𝑐𝑖𝑖 ≔ assign cluster 𝑐𝑐𝑖𝑖 to sample 𝑥𝑥𝑖𝑖 as the smallest Euclidean distance
between 𝑥𝑥𝑖𝑖 and the centroids
• 𝑓𝑓𝑜𝑜𝑜𝑜 𝑘𝑘 = 1 𝑡𝑡𝑡𝑡 𝐾𝐾:
• 𝜇𝜇𝑘𝑘 ≔ mean of the points assigned to cluster 𝑐𝑐𝑘𝑘
Demo
• Scikit-learn library’s k-mean
Machine Learning
Supervised Unsupervised
Regression Classification Clustering

Other Machine Learning Algorithms
•Probabilistic models
•Ensemble methods
•Reinforcement Learning
•Recommendation algorithms (e.g., Matrix
Factorization)
•Deep Learning
Linear vs Non-linear
x2 y
price
x1 square meter x
Multi-layer Neural Networks
Support Victor Machines (kernel trick)
2 2
Kernel Trick 𝐾𝐾 𝑥𝑥1 , 𝑥𝑥2 = [𝑥𝑥1 , 𝑥𝑥2 , 𝑥𝑥1 + 𝑥𝑥2 ]
Image from http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html

Practical aspects
Data Preprocessing
• Missing data
• Elimination (samples/features)
• Imputation
• Categorical data
• Mapping (for ordinal features)
• One-hot-encoding (for nominal features)
• Features scaling (normalization, standardization)
𝑥𝑥 (𝑖𝑖) − 𝜇𝜇𝑥𝑥
(𝑖𝑖) 𝑥𝑥 (𝑖𝑖)
− 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚 (𝑖𝑖)
𝑥𝑥𝑠𝑠𝑠𝑠𝑠𝑠 = ,
𝑥𝑥𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 = 𝜎𝜎𝑥𝑥
𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚 − 𝑥𝑥𝑚𝑚𝑚𝑚𝑚𝑚 𝑤𝑤ℎ𝑒𝑒𝑒𝑒𝑒𝑒 𝜇𝜇𝑥𝑥 : 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑜𝑜 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑥𝑥, 𝜎𝜎𝑥𝑥 : 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑
• Data/problem specific preprocessing (e.g., images, signals, text)

Model Evaluation
• Splitting data (training, validation, testing) IMPORTANT
• No hard rule: usually 60%-20%-20% will be fine
Training Validation Testing
• k-fold cross-validation
• If dataset is very small
• Leave-one-out
• Fine-tuning hyper-parameters
• Automated hyper-parameter selection
• Using validation set
Performance Measures
• Depending on the problem
• Some of the well-known measure are:
• Classification measures
• Accuracy
• Confusion matrix and related measures
• Regression
• Mean Squared Error
• R2 metric
• Clustering performance measure is not straight forward, and will not
be discussed here
Performance Measures: Accuracy
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 =
𝑎𝑎𝑎𝑎𝑎𝑎 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
• If we have 100 persons, one of them having cancer. What is the accuracy if
classify all of them as having no cancer?
• Accuracy is not good for heavily biased class distribution
Performance Measures: Confusion matrix
Actual Class
Positive Negative
Positive Ture Positives (TP) False Positives (FP) 𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇

Predicated 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 =
Class 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 + 𝐹𝐹𝐹𝐹 + 𝑇𝑇𝑇𝑇
False Negatives True Negatives
Negative
(FN) (TN)
𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 ∗ 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅

𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = 𝐹𝐹 − 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 2 ∗
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 + 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅
a measure of result relevancy a measure of how many the harmonic mean of precision and recall
truly relevant results are
returned
Performance Measures: Mean Squared Error (MSE)
𝑛𝑛
• Defined as 1
𝑀𝑀𝑀𝑀𝑀𝑀 = �(𝑦𝑦�𝑖𝑖 − 𝑦𝑦𝑖𝑖 )2
𝑛𝑛
𝑖𝑖=1
• Gives an idea of how wrong the predictions were

• Only gives an idea of the magnitude of the error, but not the direction (e.g. over
or under predicting)
• Root Mean Squared Error (RMSE) is the square root of MSE, which has the same
unit of the data
Performance Measures: R 2
• Is a statistical measure of how close the data are to the fitted

regression line
• Also known as the coefficient of determination
• Has a value between 0 and 1 for no-fit and perfect fit, respectively
𝑆𝑆𝑆𝑆𝑟𝑟𝑟𝑟𝑟𝑟 = �(𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 )2
𝑖𝑖
� 2 , 𝑤𝑤ℎ𝑒𝑒𝑒𝑒𝑒𝑒 𝑦𝑦� 𝑖𝑖𝑖𝑖 𝑡𝑡ℎ𝑒𝑒 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑜𝑜 𝑡𝑡ℎ𝑒𝑒 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑

𝑆𝑆𝑆𝑆𝑡𝑡𝑡𝑡𝑡𝑡 = �(𝑦𝑦𝑖𝑖 − 𝑦𝑦)
𝑖𝑖
𝑆𝑆𝑆𝑆𝑟𝑟𝑟𝑟𝑟𝑟
𝑅𝑅2 =1−
𝑆𝑆𝑆𝑆𝑡𝑡𝑡𝑡𝑡𝑡
Dimensionality Reduction
Curse of dimensionality
when the dimensionality

increases  the volume
of the space increases so
fast that the available
data become sparse
Image from http://www.newsnshit.com/curse-of-dimensionality-interactive-demo/

Feature selection
• Comprehensive (all subsets)
• Forward stepwise
• Backward stepwise
• Forward-Backward
• Many more…
Features compression/projection
• Project to lower dimensional
space while preserving as
much information as possible
• Principle Component Analysis
(PCA)
• Unsupervised method
Image from http://compbio.pbworks.com/w/page/16252905/Microarray%20Dimension%20Reduction

Overfitting and Underfitting
Overfitting High Variance
y x2
Price (*1000)
square meter x x1
Underfitting High Bias
y x2
Price (*1000)
square meter x x1
Training vs. Testing Errors
• Accuracy on training set is not representative of
model performance
• We need to calculate the accuracy on the test
set (a new unseen examples)
• The goal is to generalize the model to work on
unseen data
Bias and variance trade-off
High Bias High Variance
Testing
The optimal is
to have low
error
bias and low

variance
Training
Low Model Complexity High

Learning Curves
High Bias High Variance
Validation
Validation
Training
error
error
Training
Low Number of Training Samples High Low Number of Training Samples High
Regularization
• To prevent overfitting
• Decrease the complexity of the model
• Example of regularized regression model (Ridge
Regression)
𝑁𝑁 2 𝑘𝑘 2
𝑅𝑅𝑅𝑅𝑅𝑅 𝑤𝑤0 , 𝑤𝑤1 = ∑𝑖𝑖=1(𝑦𝑦�𝑖𝑖 − 𝑦𝑦𝑖𝑖 ) + 𝜆𝜆 ∑𝑗𝑗 𝑤𝑤𝑗𝑗 ,
𝑘𝑘 = 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑡𝑡𝑡𝑡
• 𝜆𝜆 is a very important hyper-parameter
Debugging a Learning Algorithm
• From “Machine Learning” course on coursera.org, by Andrew Ng
• Get more training examples  fixes high variance

• Try smaller sets of features  fixes high variance
• Try getting additional features  fixes high bias
• Try adding polynomial features (e.g., 𝑥𝑥12 , 𝑥𝑥22 , 𝑥𝑥1 , 𝑥𝑥2 , 𝑒𝑒𝑒𝑒𝑒𝑒)  fixes high
bias
• Try decreasing λ  fixes high bias
• Try increasing 𝜆𝜆  fixes high variance
What is the best ml algorithm?
•“No free lunch” theorem: there is no

one model that works best for every
problem
•We need to try and compare different
models and assumptions
•Machine learning is full of uncertainty
‫ﻣﺘﻼزﻣﺔ وﯾﻜﺎ‬
Weka syndrome
‫ﻧﺼﺎﺋﺢ ﻟﺘﻄﺒﯿﻖ ﺗﻌﻠﻢ اﻵﻟﺔ‬
‫• اﻓﮭم اﻟﻣﺷﻛﻠﺔ اﻟﺗﻲ أﻧت ﺑﺻدد ﺣﻠﮭﺎ ﺟﯾدا ً‬
‫• ﻣﺎ ھﻲ اﻟﻣﺷﻛﻠﺔ؟‬
‫• ﻣﺎ اﻟﻣطﻠوب ﺣﻠﮫ ﺑﺎﻟﺿﺑط؟‬
‫• ھل ﺗﺣﺗﺎج أن ﺗﺗﻌﻠم أﻣور ﻣﺗﻌﻠﻘﺔ ﺑﺎﻟﻣﺟﺎل؟ )طﺑﻲ‪ ،‬ﺗﺳوﯾﻖ‪(... ،‬‬
‫• اﻓﮭم اﻟﺑﯾﺎﻧﺎت اﻟﻣﺗﺎﺣﺔ ﺟﯾدا ً‬
‫• ﺣﺟم اﻟﺑﯾﺎﻧﺎت‬
‫• إﺟراء ﺑﻌض اﻹﺣﺻﺎءات اﻟوﺻﻔﯾﺔ )‪ (descriptive statistics‬ﻋﻠﻰ اﻟﺑﯾﺎﻧﺎت‬
‫• ھل ﺗوﺟد أﺟزاء ﻧﺎﻗﺻﺔ؟ ﻛﯾف ﺗﺗﻌﺎﻣل ﻣﻌﮭﺎ؟‬
‫• ﺣدد ﻋ ّدة ﺧوارزﻣﯾﺎت ﻻﺧﺗﺑﺎرھﺎ ﺑﻧﺎ ًء ﻋﻠﻰ وﺻف اﻟﻣﺷﻛﻠﺔ واﻟﺑﯾﺎﻧﺎت‬
‫اﻟﻣﺗﺎﺣﺔ‬
‫• ?‪Regression? Classification? Clustering? Other‬‬
‫• ھل ﺗﺣﺗﺎج ‪regularization‬؟‬
‫• اﻟﺧﺻﺎﺋص اﻟﻣﻣﯾزة ‪features‬‬
‫• ھل ھﻲ ﺟﺎھزة‪ ،‬أم ﺗﺣﺗﺎج أن ﺗﺳﺗﺧﻠﺻﮭﺎ؟ )ﻣﺛﻼً ﻣن ﺻورة أو ﻧﺻوص(‬
‫• ھل ﺗﺣﺗﺎج ﻟﺗﻘﻠﯾﻠﮭﺎ؟ )‪(feature selection or projection‬‬
‫• ھل ﺗﺣﺗﺎج إﻟﻰ ‪scaling‬؟‬
‫• ﺻﻣم اﻻﺧﺗﺑﺎر‬
‫ﻛﯾف ﺗﻘﺳم اﻟﺑﯾﺎﻧﺎت؟ )‪(60% training, 20% validation, 20% testing‬‬ ‫•‬
‫‪Evaluation Measures‬‬ ‫•‬
‫)‪Hyper-parameters selection (using validation split‬‬ ‫•‬
‫‪Plot learning curves to asses bias and variance‬‬ ‫•‬
‫• ﻣﺎذا ﺗﻔﻌل؟‬
‫• اﻟﻣزﯾد ﻣن اﻟﺑﯾﺎﻧﺎت؟‬
‫• ﺗﻘﻠﯾل اﻟﺧﺻﺎﺋص أو زﯾﺎدﺗﮭﺎ أو ﻣزﺟﮭﺎ؟‬
‫• ﺑﻌد أن ﺗﻧﺗﮭﻲ ﻣن ﻛل ھذا‪ ،‬طﺑﻖ اﻟﺧوارزﻣﯾﺎت اﻟﺗﻲ اﺧﺗرﺗﮭﺎ ﻋﻠﻰ ﺑﯾﺎﻧﺎت اﻻﺧﺗﺑﺎر ‪testing‬‬
‫‪ ،split‬واﺧﺗر ﻣﻧﮭﺎ ﻣﺎ ﺗﻌﺗﻘد أﻧﮫ اﻷﻧﺳب‬
‫ﺑﺮﻧﺎﻣﺞ ﻣﻘﺘﺮح ﻟﺘﻌﻠﻢ اﻟﻤﺠﺎل‬
‫• ﻣراﺟﻌﺔ اﻟﻣواﺿﯾﻊ اﻟﺗﺎﻟﯾﺔ ﻓﻲ اﻟرﯾﺎﺿﯾﺎت واﻹﺣﺻﺎء‬
Descriptive Statistics •
Inferential Statistics •
Probability •
Linear Algebra •
Basics of differential equations •
Python Machine Learning :‫• ﻗراءة ﻛﺗﺎب‬

‫• اﻟﺗﺳﺟﯾل ﻓﻲ دورة ”‪ “Machine Learning‬ﻓﻲ ‪coursera.org‬‬
‫• ‪www.coursera.org/learn/machine-learning‬‬
‫• وﺣل ﺟﻣﯾﻊ اﻟﺗﻣﺎرﯾن‬
‫• ﺗﻌﻠم ﻟﻐﺔ ﺑرﻣﺟﺔ واﻟﻣﻛﺗﺑﺎت ذات اﻟﻌﻼﻗﺔ ﺑﺗﻌﻠم اﻵﻟﺔ‬

‫• ‪Matlab‬‬
‫• ‪Python‬‬
‫•‪R‬‬
‫ واﻟﻌﻣل ﻋﻠﻰ ﺑﻌض اﻟﺑﯾﺎﻧﺎت‬،(www.kaggle.com) ‫• اﻟﺗﺳﺟﯾل ﻓﻲ ﻛﺎﺟل‬
.‫واﻟﺗﺣدﯾﺎت اﻟﻣﺗﺎﺣﺔ‬
:‫• ﻣﻘﺗرح ﻟﻠﺑداﯾﺔ‬
Titanic: Machine Learning from Disaster •
https://www.kaggle.com/c/titanic
House Prices: Advanced Regression Techniques •
https://www.kaggle.com/c/house-prices-advanced-regression-
techniques
Digit Recognizer •
https://www.kaggle.com/c/digit-recognizer
‫• ﻣراﺟﻌﺔ أﺧرى ﻟﻺﺣﺻﺎء واﻟرﯾﺎﺿﯾﺎت ﻟﺗﻘوﯾﺔ اﻷﻣور اﻟﺗﻲ ﺗﻌرف اﻵن أﻧك ﺗﺣﺗﺎﺟﮭﺎ‬
‫• ﻛﺗب أﺧرى ﻣﻘﺗرﺣﺔ ﻟﺗﻌﻠم اﻵﻟﺔ )أﻛﺛر ﺗﻘدﻣﺎ ً(‬
‫• اﻟﺗرﻛﯾز ﻋﻠﻰ ﻣﺟﺎل ﺗرﯾده‬
‫• اﻟﺗﻧﺑؤ اﻟﻣﺳﺗﻘﺑﻠﻲ ﻟﻸﻋﻣﺎل )ﻛﺎﻟﺗﺳوﯾﻖ(‬
‫• ﻣﻌﺎﻟﺟﺔ اﻟﻠﻐﺎت اﻟطﺑﯾﻌﯾﺔ‬
‫• رؤﯾﺔ اﻟﺣﺎﺳب‬
‫• اﻋرض ﻧﺗﺎﺋﺞ أﻋﻣﺎﻟك‬
‫• ﺷﺎرك اﻟﻧﺎس ﻋﻣﻠك )ﻣﺛل اﻟﻛود واﻟﻧﺗﺎﺋﺞ( واطﻠب رأﯾﮭم‬
‫• أﻗم دورة ﻓﻲ اﻟﻣﺟﺎل اﻟذي ﺗﺧﺻﺻت ﺑﮫ ‪‬‬
‫ﺷﻜﺮا ً ﻟﻜﻢ ﻋﻠﻰ ﺣﻀﻮرﻛﻢ وإﻧﺼﺎﺗﻜﻢ‬

MachineLearning PDF

Uploaded by

Copyright:

Available Formats

MachineLearning PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MachineLearning PDF

Uploaded by

Copyright:

Available Formats

‫ﺗﻨﻈﯿﻢ‬

𝑦𝑦� = 𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥

that give lowest RSS

Image from https://ccse.lbl.gov/Research/Optimization/index.html

derivative < 0 derivative > 0

The gradient points in the direction

Image from http://codingwiththomas.blogspot.com/2012/09/particle-swarm-optimization.html

Update the weights to minimize the cost function

Image from wikimedia.org

𝑦𝑦� = 𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥 1 + 𝑤𝑤2 𝑥𝑥2 + ⋯ + 𝑤𝑤𝑛𝑛 𝑥𝑥𝑛𝑛

• Gradient descent: a lot of iterations

• Gradient descent: works with large number of features

𝐽𝐽 𝒘𝒘 = 𝑦𝑦 log(ℎ𝒘𝒘 𝒙𝒙 ) + 1 − 𝑦𝑦 log(1 − ℎ𝒘𝒘 (𝒙𝒙))

input hidden output

Safe Term? Incom?

3 year 5 year high low

Risky Safe Term? Risky

Regression Classification Clustering

Image from http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html

• Data/problem specific preprocessing (e.g., images, signals, text)

Training Validation Testing

Positive Ture Positives (TP) False Positives (FP) 𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇

𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 ∗ 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅

• Gives an idea of how wrong the predictions were

• Is a statistical measure of how close the data are to the fitted

� 2 , 𝑤𝑤ℎ𝑒𝑒𝑒𝑒𝑒𝑒 𝑦𝑦� 𝑖𝑖𝑖𝑖 𝑡𝑡ℎ𝑒𝑒 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑜𝑜 𝑡𝑡ℎ𝑒𝑒 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑

when the dimensionality

Image from http://www.newsnshit.com/curse-of-dimensionality-interactive-demo/

Image from http://compbio.pbworks.com/w/page/16252905/Microarray%20Dimension%20Reduction

bias and low

Low Model Complexity High

• Get more training examples  fixes high variance

•“No free lunch” theorem: there is no

Python Machine Learning :‫• ﻗراءة ﻛﺗﺎب‬

‫• ﺗﻌﻠم ﻟﻐﺔ ﺑرﻣﺟﺔ واﻟﻣﻛﺗﺑﺎت ذات اﻟﻌﻼﻗﺔ ﺑﺗﻌﻠم اﻵﻟﺔ‬

You might also like