BC2406 Week 2
BC2406 Week 2
BC2406 Week 2
Unit 2
Based on Chew C. H. (2019) textbook: Analytics, Data Science and AI. Vol 1., Chap 2.
Seminar Objectives
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
3
Role of Visualization and Models
Visualization
Demand
Analytics Model
Price
Visualization
• Used to get easier understanding than analytics models
• Used to communicate/explain the results in an easier
way
• Requires minimal training
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
4
Role of Visualization and Models
Var3
Var2
Var1 Var1
Affect
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1. 5
Role of Visualization and Models
Results
Data Exploration Model Development
Communication
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
6
Classification of Problems in Analytics
Actual Y Predicted Y Feedback
1 = 1 Right
0 ≠ 1 Wrong
Prediction Y does not exist but
Predicting Y Problems can obtain feedback
Y exists.
Finding General
Classification Estimation Patterns
Y does not exist,
and no feedback
Y is Y is
categorical continuous
e.g., Logistic e.g., Linear
Regression Regression 7
Classification of Problems in Analytics
New Image
Supervised Learning
20 images
Train Model
Predict
20 images Dog or Cat?
8
Classification of Problems in Analytics
3 Customer Segments
Unsupervised Learning 9
Classification of Problems in Analytics
Reinforcement Learning
• Set basic driving rules and let
the model drive a car
• Give penalties (rewards) for
bad (good) driving results
10
Models* on Explanability Scale
Simple Complex
Highest Lowest
Explanability Explanability
Power (White Box) Power (Black Box)
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
4
Is my model correct? – The Correct Model Fallacy
PRINCIPLE 1: MANY
CORRECT MODELS
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
13
Example: Baby Data
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
14
Example: Baby Data
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
15
Model Predictive Accuracy for Continuous Y
𝑖=𝑛
2
RMSE (𝑦
ෝ𝑖 − 𝑦𝑖 )
𝑖=1
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
17
Model Predictive Accuracy for Categorical Y
Confusion Matrix
Actual
Not Fraud Fraud
Model Not Fraud 10 17
Prediction Fraud 3 20
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
18
Is my model good enough?
• Based on Predictive Accuracy. Better than target?
• Setting target for predictive accuracy:
▪ historical predictive performance in your company
▪ best benchmark achieved in recent research papers
▪ finding out the desired business impact, and reverse engineer
the required target
▪ comparing against a standard model (Linear Regression or
Logistic Regression)
▪ what the boss wants [may or may not be realistic]
• Based on use requirements.
▪ A&E Doctors and nurses want a simple model that they could
use without statistics background.
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
19
Zero Prediction Error! The Perfect Model Fallacy.
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
20
Zero Prediction Error
= Excellent ?
21
Previous Exam
Questions (Data) Knowledge (Model)
Train
Test
22
Train Set of Previous
Exam Questions (Data)
Knowledge (Model)
mutually exclusive
23
Zero Prediction Error?
• Too good to be true
• Typically obtained based on trainset
▪ Implications of Polynomial Interpolation Theorem
▪ Neural Network infamous for
▪ Biased and Over-optimistic
• Do not be misled
▪ Predict Historical Data, or
▪ Predict Future Data
• Model Predictive Accuracy
▪ How to decide if model is good enough to be used, if not based
on historical data?
▪ What would be a fair, unbiased estimate of error?
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
24
Train – Test Split Procedure (Basic)
25
Complete Dataset
X1 X2 … Xm Y
… … … … …
… … … … …
… … … … …
(1) Split the … … … … …
dataset randomly … … … … …
… … … … …
into train/test set … … … … …
… … … … …
… … … … …
… … … … …
70% Train 30% Test
Model with
(2) Train model predicted Y (3) Test model by comparing
using train set model predicted Y vs actual
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1. Y in test set
Train – Test Split Procedure (Basic)
26
Complete Dataset
X1 X2 … Xm Y
… … … … …
… … … … …
… … … … …
(1) Split the … … … … …
dataset randomly … … … … …
… … … … …
into train/test set … … … … …
… … … … …
… … … … …
… … … … …
90% Train 10% Test
Complete Dataset
X1 X2 … Xm Y
… … … … …
… … … … …
… … … … …
(1) Split the … … … … …
dataset randomly … … … … …
… … … … …
into train/test set … … … … …
… … … … …
… … … … …
… … … … …
50% Train 50% Test
Complete Dataset
X1 X2 … Xm Y
… … … … 0…
… … … … 0…
… … … … 1…
(1) Split the … … … … 0… If Y is Categorical
dataset randomly … … … … 0… & Rare Event
into train/test set … … … … 1…
… … … … 0… (e.g., Cancer)
… … … … 0…
… … … … 0…
… … … … 0…
70% Train 30% Test
Complete Dataset
X1 X2 … Xm Y
(1) Stratify on … … … … 1…
… … … … 1… Y = 1 (e.g., Cancer)
categorical Y
… … … … 0…
… … … … 0…
(2) Split the … … … … 0…
dataset randomly, … … … … 0…
within each strata, … … … … 0…
… … … … 0…
into train/test set … … … … 0… Y = 0 (e.g., No Cancer)
… … … … 0…
70% Train 30% Test
Train Set Test Set
X1 X2 … Xm Y X1 X2 … Xm Y
… … … … 1… … … … … 1
…
… … … … 0… … … … … 0
…
… … … … 0… … … … … 0
…
… … … … 0…
… … … … 0…
… … … … 0…
… … … … 0…
Model with
(3) Train model predicted Y (4) Test model by comparing
using train set model predicted Y vs actual
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1. Y in test set
Train Test Split
30
How to select best model within the model type?
PRINCIPLE 3: COMPLEXITY
ADJUSTED MODEL PERFORMANCE
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
31
Implication of Model Complexity
• Idea from Polynomial Interpolation Theorem
• Hit zero error on trainset by using very complex model
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1. 32
Implication of Model Complexity
• Idea from Polynomial Interpolation Theorem
• Hit zero error on trainset by using very complex model
Y Complex Model
10
Test set error
Simple Model
Train set error = 0
Train set
1
Test set
0 X
1 5 10
What is Model Complexity?
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1. 34
Overfitting Risk
Underfit Overfit
Model
Prediction
Error Testset
Trainset
Model
Too Simple Too Complex Complexity
36
Learning R
BASICS
37
Install R and RStudio
• Install R first
• Then install RStudio
• Before week 2 class.
• Bring your laptop to class
▪ To conserve battery, switch off wi-fi when
there is no need for internet access.
38
4 Panels work area in RStudio
39
Operators: R as a calculator
40
The assignment Operator: <-
42
R is spelling and case sensitive
43
R functions
Type se and hit TAB. A popup shows you possible completions. Specify
seq() by typing more (a “q”) to disambiguate, or by using ↑/↓ arrows to
select. Notice the floating tooltip that pops up, reminding you of the
function’s arguments and purpose. If you want more help, press F1 to get all
the details in the help tab in the lower right pane.
Press TAB once more when you’ve selected the function you want. RStudio
will add matching opening (() and closing ()) parentheses for you. Type the
arguments 1, 10 and hit return.
44
Text String
45
Create Vectors with c function
46
Resources for Learning R
47
Summary
• Visualization and Models have important roles.
• Fundamental Analytics concepts.
▪ More than one correct model.
▪ The importance of Trainset vs Testset.
▪ Beware of low/zero prediction error on trainset.
• R Basics
▪ Operators, Functions, Basic Plot.
▪ Import Dataset into R.
▪ Create new Data and Dataset within R.
▪ Export processed Dataset out of R.
▪ Saving and running Rscripts.
▪ Errors are natural in programming. Just resolve them.
51