BC2406 Week 2

BC2406 Analytics I
Visual and Predictive Techniques
Unit 2
Fundamental Concepts and Principles of

Analytics
Based on Chew C. H. (2019) textbook: Analytics, Data Science and AI. Vol 1., Chap 2.
Seminar Objectives
• Learn how industry conduct Analytics projects.

• Learn fundamental, critical concepts that distinguish
Analytics as a field of study (esp vs Statistics).
• Learn some common misconceptions and avoid them in
your analytics work.
• Learn basic use of R and Rscripting
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
3
Role of Visualization and Models
Visualization
Demand
Analytics Model
𝐷𝑒𝑚𝑎𝑛𝑑 = 𝑃𝑟𝑖𝑐𝑒 2 − 14 ∗ 𝑃𝑟𝑖𝑐𝑒 + 70
Price
Visualization
• Used to get easier understanding than analytics models
• Used to communicate/explain the results in an easier
way
• Requires minimal training
4
Var3
Var2
Var1 Var1
Product Sales Price, Shipping Fee, Consumer Reviews,

at Qoo10 Discount Rate, Return Policy, Competitors, Etc.
Affect
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1. 5
Results
Data Exploration Model Development
Communication
Statistics Model Structure Charts
Graphs Model Evaluation Dashboard
6
Classification of Problems in Analytics
Actual Y Predicted Y Feedback
1 = 1 Right
0 ≠ 1 Wrong
Prediction Y does not exist but
Predicting Y Problems can obtain feedback
Y exists.
Supervised Unsupervised Reinforcement

Learning Learning Learning
Finding General
Classification Estimation Patterns
Y does not exist,
and no feedback
Y is Y is
categorical continuous
e.g., Logistic e.g., Linear
Regression Regression 7
New Image
Supervised Learning
20 images
Train Model
Predict
20 images Dog or Cat?
8
3 Customer Segments
Unsupervised Learning 9
Reinforcement Supervised Learning

Learning • Too many situations to be
modeled
• Too many datasets are
required to train the model
Reinforcement Learning
• Set basic driving rules and let
the model drive a car
• Give penalties (rewards) for
bad (good) driving results
10
Models* on Explanability Scale
Linear Logistic Quantile Neural Deep

CART MARS
Regression Regression Regression Network Learning
Simple Complex
Highest Lowest
Explanability Explanability
Power (White Box) Power (Black Box)
*: Selected list of models (non-exhaustive) on the Explanability Scale.
4
Is my model correct? – The Correct Model Fallacy
PRINCIPLE 1: MANY
CORRECT MODELS
13
Example: Baby Data
• Task: Build a Linear Regression Model to predict Weight (Y).

• You can use any input variables Xi
Disclaimer: Data is not real and meant for illustration only.
14
Example: Baby Data
• Any of the above model is fine.

• More than one correct model.
• Unlearn previous mathematical education.
• Is my model good enough? How to judge?
15
Model Predictive Accuracy for Continuous Y
𝑖=𝑛
2
RMSE ෍ (𝑦
ෝ𝑖 − 𝑦𝑖 )
𝑖=1
17
Model Predictive Accuracy for Categorical Y
Confusion Matrix
Actual
Not Fraud Fraud
Model Not Fraud 10 17
Prediction Fraud 3 20
18
Is my model good enough?
• Based on Predictive Accuracy. Better than target?
• Setting target for predictive accuracy:
▪ historical predictive performance in your company
▪ best benchmark achieved in recent research papers
▪ finding out the desired business impact, and reverse engineer
the required target
▪ comparing against a standard model (Linear Regression or
Logistic Regression)
▪ what the boss wants [may or may not be realistic]
• Based on use requirements.
▪ A&E Doctors and nurses want a simple model that they could
use without statistics background.
19
Zero Prediction Error! The Perfect Model Fallacy.
PRINCIPLE 2: TRAIN TEST

SPLIT
20
Zero Prediction Error
= Excellent ?
21
Previous Exam
Questions (Data) Knowledge (Model)
Train
Test
22
Train Set of Previous
Exam Questions (Data)
Knowledge (Model)
mutually exclusive
Test Set of Previous

Exam Questions (Data)
23
Zero Prediction Error?
• Too good to be true
• Typically obtained based on trainset
▪ Implications of Polynomial Interpolation Theorem
▪ Neural Network infamous for
▪ Biased and Over-optimistic
• Do not be misled
▪ Predict Historical Data, or
▪ Predict Future Data
• Model Predictive Accuracy
▪ How to decide if model is good enough to be used, if not based
on historical data?
▪ What would be a fair, unbiased estimate of error?
24
Train – Test Split Procedure (Basic)
25
Complete Dataset
X1 X2 … Xm Y
… … … … …
… … … … …
… … … … …
(1) Split the … … … … …
dataset randomly … … … … …
… … … … …
into train/test set … … … … …
… … … … …
… … … … …
… … … … …
70% Train 30% Test
Train Set Test Set

X1 X2 … Xm Y X1 X2 … Xm Y
… … … … … … … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
Model with
(2) Train model predicted Y (3) Test model by comparing
using train set model predicted Y vs actual
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1. Y in test set
26
Complete Dataset
X1 X2 … Xm Y
… … … … …
… … … … …
… … … … …
(1) Split the … … … … …
… … … … …
… … … … …
… … … … …
… … … … …
90% Train 10% Test
Train Set Test Set

… … … … … … … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … … Better
Less Reliable
Model with
27
Complete Dataset
X1 X2 … Xm Y
… … … … …
… … … … …
… … … … …
(1) Split the … … … … …
… … … … …
… … … … …
… … … … …
… … … … …
50% Train 50% Test
Train Set Test Set

… … … … … … … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … … Worse
More Reliable
Model with
28
Complete Dataset
X1 X2 … Xm Y
… … … … 0…
… … … … 0…
… … … … 1…
(1) Split the … … … … 0… If Y is Categorical
dataset randomly … … … … 0… & Rare Event
into train/test set … … … … 1…
… … … … 0… (e.g., Cancer)
… … … … 0…
… … … … 0…
… … … … 0…
70% Train 30% Test
Train Set Test Set

… … … … 0… … … … … 0
… No Y=1
… … … … 1… 0
… … … … 0… All Y=1 … … … … …
Case
… … … … 1…
… … … … 0
…
… … … … 0… Cases
… … … … 0…
… … … … 0…
Model with
Train – Test Split Procedure (Enhanced)
29
Complete Dataset
X1 X2 … Xm Y
(1) Stratify on … … … … 1…
… … … … 1… Y = 1 (e.g., Cancer)
categorical Y
… … … … 0…
… … … … 0…
(2) Split the … … … … 0…
dataset randomly, … … … … 0…
within each strata, … … … … 0…
… … … … 0…
into train/test set … … … … 0… Y = 0 (e.g., No Cancer)
… … … … 0…
70% Train 30% Test
Train Set Test Set
… … … … 1… … … … … 1
…
… … … … 0… … … … … 0
…
… … … … 0… … … … … 0
…
… … … … 0…
… … … … 0…
… … … … 0…
… … … … 0…
Model with
Train Test Split
• Split your historical dataset into two pieces

▪ Stratify on y. [Why?]
▪ Typically, but not always, 70% Training Set
▪ Typically, but not always, 30% Test Set
• Develop your model on the Training Set
• Test your model on the Test Set
▪ Obtain unbiased estimate of model prediction error.
30
How to select best model within the model type?
PRINCIPLE 3: COMPLEXITY
ADJUSTED MODEL PERFORMANCE
31
Implication of Model Complexity
• Idea from Polynomial Interpolation Theorem
• Hit zero error on trainset by using very complex model
Two Data Points Three Data Points
First Degree Second Degree N Data Points

Polynomial Polynomial
(𝑌 = 𝑎𝑥 + 𝑏) (𝑌 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐)
Four Data Points Five Data Points N-1 th Degree

Polynomial
(𝑌 = 𝑎𝑛−1 𝑥 𝑛−1 + ⋯ + 𝑎0 )
Third Degree Fourth Degree

Polynomial Polynomial
(𝑌 = 𝑎𝑥 3 + 𝑏𝑥 2 + 𝑐𝑥 + 𝑑) (𝑌 = 𝑎𝑥 4 + 𝑏𝑥 3 + ⋯ + 𝑑)
Implication of Model Complexity
• Idea from Polynomial Interpolation Theorem
• Hit zero error on trainset by using very complex model
Y Complex Model
10
Test set error
Simple Model
Train set error = 0
Train set
1
Test set
0 X
1 5 10
What is Model Complexity?
Model Complexity : Train Set Error : Test Set Error

= # Input Variables
Overfitting Risk
Underfit Overfit
Model
Prediction
Error Testset
Trainset
Model
Too Simple Too Complex Complexity
Beyond a certain model complexity, testset error

increases while trainset error continue decreasing.
Complexity-Adjusted Model
Performance
• Find a way to balance both model prediction error and
model complexity.
• Classification and Regression Tree (CART) does this
automatically with an “alpha” parameter
▪ More details in Unit 8 Decision Trees.
36
Learning R
BASICS
37
Install R and RStudio
• Install R first
• Then install RStudio
• Before week 2 class.
• Bring your laptop to class
▪ To conserve battery, switch off wi-fi when
there is no need for internet access.
38
4 Panels work area in RStudio
39
Operators: R as a calculator
40
The assignment Operator: <-
RStudio’s keyboard shortcut: Alt and - (the minus sign)

41
Object Naming Convention in R
42
R is spelling and case sensitive
43
R functions
Type se and hit TAB. A popup shows you possible completions. Specify
seq() by typing more (a “q”) to disambiguate, or by using ↑/↓ arrows to
select. Notice the floating tooltip that pops up, reminding you of the
function’s arguments and purpose. If you want more help, press F1 to get all
the details in the help tab in the lower right pane.
Press TAB once more when you’ve selected the function you want. RStudio
will add matching opening (() and closing ()) parentheses for you. Type the
arguments 1, 10 and hit return.
44
Text String
45
Create Vectors with c function
A vector is a set of values that are all of the same type.
Hint: Use the class function to check on the type of object.
46
Resources for Learning R
• BC3407 R and Python for BA

• Textbooks
• Blogs
• Online Tutorials (Quick R):
▪ Quick R: https://www.statmethods.net/
47
Summary
• Visualization and Models have important roles.
• Fundamental Analytics concepts.
▪ More than one correct model.
▪ The importance of Trainset vs Testset.
▪ Beware of low/zero prediction error on trainset.
• R Basics
▪ Operators, Functions, Basic Plot.
▪ Import Dataset into R.
▪ Create new Data and Dataset within R.
▪ Export processed Dataset out of R.
▪ Saving and running Rscripts.
▪ Errors are natural in programming. Just resolve them.
51

BC2406 Week 2

Uploaded by

Copyright:

Available Formats

BC2406 Week 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BC2406 Week 2

Uploaded by

Copyright:

Available Formats

BC2406 Analytics I

Visual and Predictive Techniques

Fundamental Concepts and Principles of

• Learn how industry conduct Analytics projects.

𝐷𝑒𝑚𝑎𝑛𝑑 = 𝑃𝑟𝑖𝑐𝑒 2 − 14 ∗ 𝑃𝑟𝑖𝑐𝑒 + 70

Product Sales Price, Shipping Fee, Consumer Reviews,

Statistics Model Structure Charts

Graphs Model Evaluation Dashboard

Supervised Unsupervised Reinforcement

Reinforcement Supervised Learning

Linear Logistic Quantile Neural Deep

*: Selected list of models (non-exhaustive) on the Explanability Scale.

• Task: Build a Linear Regression Model to predict Weight (Y).

Disclaimer: Data is not real and meant for illustration only.

• Any of the above model is fine.

PRINCIPLE 2: TRAIN TEST

Test Set of Previous

Train Set Test Set

Train Set Test Set

Train Set Test Set

Train Set Test Set

• Split your historical dataset into two pieces

Two Data Points Three Data Points

First Degree Second Degree N Data Points

Four Data Points Five Data Points N-1 th Degree

Third Degree Fourth Degree

Model Complexity : Train Set Error : Test Set Error

Beyond a certain model complexity, testset error

RStudio’s keyboard shortcut: Alt and - (the minus sign)

A vector is a set of values that are all of the same type.

Hint: Use the class function to check on the type of object.

• BC3407 R and Python for BA

You might also like