BC2406 Week 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

BC2406 Analytics I

Visual and Predictive Techniques

Unit 2

Fundamental Concepts and Principles of


Analytics

Based on Chew C. H. (2019) textbook: Analytics, Data Science and AI. Vol 1., Chap 2.
Seminar Objectives

• Learn how industry conduct Analytics projects.


• Learn fundamental, critical concepts that distinguish
Analytics as a field of study (esp vs Statistics).
• Learn some common misconceptions and avoid them in
your analytics work.
• Learn basic use of R and Rscripting

Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
3
Role of Visualization and Models
Visualization

Demand
Analytics Model

𝐷𝑒𝑚𝑎𝑛𝑑 = 𝑃𝑟𝑖𝑐𝑒 2 − 14 ∗ 𝑃𝑟𝑖𝑐𝑒 + 70

Price

Visualization
• Used to get easier understanding than analytics models
• Used to communicate/explain the results in an easier
way
• Requires minimal training
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
4
Role of Visualization and Models

Var3

Var2
Var1 Var1

Product Sales Price, Shipping Fee, Consumer Reviews,


at Qoo10 Discount Rate, Return Policy, Competitors, Etc.

Affect

Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1. 5
Role of Visualization and Models
Results
Data Exploration Model Development
Communication

Statistics Model Structure Charts

Graphs Model Evaluation Dashboard

Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
6
Classification of Problems in Analytics
Actual Y Predicted Y Feedback
1 = 1 Right
0 ≠ 1 Wrong
Prediction Y does not exist but
Predicting Y Problems can obtain feedback
Y exists.

Supervised Unsupervised Reinforcement


Learning Learning Learning

Finding General
Classification Estimation Patterns
Y does not exist,
and no feedback

Y is Y is
categorical continuous
e.g., Logistic e.g., Linear
Regression Regression 7
Classification of Problems in Analytics
New Image
Supervised Learning
20 images

Train Model

Predict
20 images Dog or Cat?
8
Classification of Problems in Analytics

3 Customer Segments

Unsupervised Learning 9
Classification of Problems in Analytics

Reinforcement Supervised Learning


Learning • Too many situations to be
modeled
• Too many datasets are
required to train the model

Reinforcement Learning
• Set basic driving rules and let
the model drive a car
• Give penalties (rewards) for
bad (good) driving results

10
Models* on Explanability Scale

Linear Logistic Quantile Neural Deep


CART MARS
Regression Regression Regression Network Learning

Simple Complex

Highest Lowest
Explanability Explanability
Power (White Box) Power (Black Box)

*: Selected list of models (non-exhaustive) on the Explanability Scale.

Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
4
Is my model correct? – The Correct Model Fallacy

PRINCIPLE 1: MANY
CORRECT MODELS

Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
13
Example: Baby Data

• Task: Build a Linear Regression Model to predict Weight (Y).


• You can use any input variables Xi

Disclaimer: Data is not real and meant for illustration only.

Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
14
Example: Baby Data

• Any of the above model is fine.


• More than one correct model.
• Unlearn previous mathematical education.
• Is my model good enough? How to judge?

Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
15
Model Predictive Accuracy for Continuous Y

𝑖=𝑛
2
RMSE ෍ (𝑦
ෝ𝑖 − 𝑦𝑖 )
𝑖=1

Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
17
Model Predictive Accuracy for Categorical Y

Confusion Matrix
Actual
Not Fraud Fraud
Model Not Fraud 10 17
Prediction Fraud 3 20

Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
18
Is my model good enough?
• Based on Predictive Accuracy. Better than target?
• Setting target for predictive accuracy:
▪ historical predictive performance in your company
▪ best benchmark achieved in recent research papers
▪ finding out the desired business impact, and reverse engineer
the required target
▪ comparing against a standard model (Linear Regression or
Logistic Regression)
▪ what the boss wants [may or may not be realistic]
• Based on use requirements.
▪ A&E Doctors and nurses want a simple model that they could
use without statistics background.

Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
19
Zero Prediction Error! The Perfect Model Fallacy.

PRINCIPLE 2: TRAIN TEST


SPLIT

Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
20
Zero Prediction Error
= Excellent ?

21
Previous Exam
Questions (Data) Knowledge (Model)

Train

Test

22
Train Set of Previous
Exam Questions (Data)

Knowledge (Model)

mutually exclusive

Test Set of Previous


Exam Questions (Data)

23
Zero Prediction Error?
• Too good to be true
• Typically obtained based on trainset
▪ Implications of Polynomial Interpolation Theorem
▪ Neural Network infamous for
▪ Biased and Over-optimistic
• Do not be misled
▪ Predict Historical Data, or
▪ Predict Future Data
• Model Predictive Accuracy
▪ How to decide if model is good enough to be used, if not based
on historical data?
▪ What would be a fair, unbiased estimate of error?

Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
24
Train – Test Split Procedure (Basic)
25

Complete Dataset
X1 X2 … Xm Y
… … … … …
… … … … …
… … … … …
(1) Split the … … … … …
dataset randomly … … … … …
… … … … …
into train/test set … … … … …
… … … … …
… … … … …
… … … … …
70% Train 30% Test

Train Set Test Set


X1 X2 … Xm Y X1 X2 … Xm Y
… … … … … … … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …

Model with
(2) Train model predicted Y (3) Test model by comparing
using train set model predicted Y vs actual
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1. Y in test set
Train – Test Split Procedure (Basic)
26

Complete Dataset
X1 X2 … Xm Y
… … … … …
… … … … …
… … … … …
(1) Split the … … … … …
dataset randomly … … … … …
… … … … …
into train/test set … … … … …
… … … … …
… … … … …
… … … … …
90% Train 10% Test

Train Set Test Set


X1 X2 … Xm Y X1 X2 … Xm Y
… … … … … … … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … … Better
Less Reliable
Model with
(2) Train model predicted Y (3) Test model by comparing
using train set model predicted Y vs actual
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1. Y in test set
Train – Test Split Procedure (Basic)
27

Complete Dataset
X1 X2 … Xm Y
… … … … …
… … … … …
… … … … …
(1) Split the … … … … …
dataset randomly … … … … …
… … … … …
into train/test set … … … … …
… … … … …
… … … … …
… … … … …
50% Train 50% Test

Train Set Test Set


X1 X2 … Xm Y X1 X2 … Xm Y
… … … … … … … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … …
… … … … … Worse
More Reliable
Model with
(2) Train model predicted Y (3) Test model by comparing
using train set model predicted Y vs actual
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1. Y in test set
Train – Test Split Procedure (Basic)
28

Complete Dataset
X1 X2 … Xm Y
… … … … 0…
… … … … 0…
… … … … 1…
(1) Split the … … … … 0… If Y is Categorical
dataset randomly … … … … 0… & Rare Event
into train/test set … … … … 1…
… … … … 0… (e.g., Cancer)
… … … … 0…
… … … … 0…
… … … … 0…
70% Train 30% Test

Train Set Test Set


X1 X2 … Xm Y X1 X2 … Xm Y
… … … … 0… … … … … 0
… No Y=1
… … … … 1… 0
… … … … 0… All Y=1 … … … … …
Case
… … … … 1…
… … … … 0

… … … … 0… Cases
… … … … 0…
… … … … 0…
Model with
(2) Train model predicted Y (3) Test model by comparing
using train set model predicted Y vs actual
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1. Y in test set
Train – Test Split Procedure (Enhanced)
29

Complete Dataset
X1 X2 … Xm Y
(1) Stratify on … … … … 1…
… … … … 1… Y = 1 (e.g., Cancer)
categorical Y
… … … … 0…
… … … … 0…
(2) Split the … … … … 0…
dataset randomly, … … … … 0…
within each strata, … … … … 0…
… … … … 0…
into train/test set … … … … 0… Y = 0 (e.g., No Cancer)
… … … … 0…
70% Train 30% Test
Train Set Test Set
X1 X2 … Xm Y X1 X2 … Xm Y
… … … … 1… … … … … 1

… … … … 0… … … … … 0

… … … … 0… … … … … 0

… … … … 0…
… … … … 0…
… … … … 0…
… … … … 0…
Model with
(3) Train model predicted Y (4) Test model by comparing
using train set model predicted Y vs actual
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1. Y in test set
Train Test Split

• Split your historical dataset into two pieces


▪ Stratify on y. [Why?]
▪ Typically, but not always, 70% Training Set
▪ Typically, but not always, 30% Test Set
• Develop your model on the Training Set
• Test your model on the Test Set
▪ Obtain unbiased estimate of model prediction error.

30
How to select best model within the model type?
PRINCIPLE 3: COMPLEXITY
ADJUSTED MODEL PERFORMANCE

Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1.
31
Implication of Model Complexity
• Idea from Polynomial Interpolation Theorem
• Hit zero error on trainset by using very complex model

Two Data Points Three Data Points

First Degree Second Degree N Data Points


Polynomial Polynomial
(𝑌 = 𝑎𝑥 + 𝑏) (𝑌 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐)

Four Data Points Five Data Points N-1 th Degree


Polynomial
(𝑌 = 𝑎𝑛−1 𝑥 𝑛−1 + ⋯ + 𝑎0 )

Third Degree Fourth Degree


Polynomial Polynomial
(𝑌 = 𝑎𝑥 3 + 𝑏𝑥 2 + 𝑐𝑥 + 𝑑) (𝑌 = 𝑎𝑥 4 + 𝑏𝑥 3 + ⋯ + 𝑑)

Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1. 32
Implication of Model Complexity
• Idea from Polynomial Interpolation Theorem
• Hit zero error on trainset by using very complex model

Y Complex Model

10
Test set error
Simple Model
Train set error = 0

Train set
1
Test set
0 X
1 5 10
What is Model Complexity?

Model Complexity : Train Set Error : Test Set Error


= # Input Variables

Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1. 34
Overfitting Risk
Underfit Overfit
Model
Prediction
Error Testset

Trainset
Model
Too Simple Too Complex Complexity

Beyond a certain model complexity, testset error


increases while trainset error continue decreasing.
Based on Neumann (2019) textbook: Analytics, Data Science and AI. Vol 1. 35
Complexity-Adjusted Model
Performance
• Find a way to balance both model prediction error and
model complexity.
• Classification and Regression Tree (CART) does this
automatically with an “alpha” parameter
▪ More details in Unit 8 Decision Trees.

36
Learning R

BASICS

37
Install R and RStudio

• Install R first
• Then install RStudio
• Before week 2 class.
• Bring your laptop to class
▪ To conserve battery, switch off wi-fi when
there is no need for internet access.

38
4 Panels work area in RStudio

39
Operators: R as a calculator

40
The assignment Operator: <-

RStudio’s keyboard shortcut: Alt and - (the minus sign)


41
Object Naming Convention in R

42
R is spelling and case sensitive

43
R functions

Type se and hit TAB. A popup shows you possible completions. Specify
seq() by typing more (a “q”) to disambiguate, or by using ↑/↓ arrows to
select. Notice the floating tooltip that pops up, reminding you of the
function’s arguments and purpose. If you want more help, press F1 to get all
the details in the help tab in the lower right pane.

Press TAB once more when you’ve selected the function you want. RStudio
will add matching opening (() and closing ()) parentheses for you. Type the
arguments 1, 10 and hit return.

44
Text String

45
Create Vectors with c function

A vector is a set of values that are all of the same type.

Hint: Use the class function to check on the type of object.

46
Resources for Learning R

• BC3407 R and Python for BA


• Textbooks
• Blogs
• Online Tutorials (Quick R):
▪ Quick R: https://www.statmethods.net/

47
Summary
• Visualization and Models have important roles.
• Fundamental Analytics concepts.
▪ More than one correct model.
▪ The importance of Trainset vs Testset.
▪ Beware of low/zero prediction error on trainset.
• R Basics
▪ Operators, Functions, Basic Plot.
▪ Import Dataset into R.
▪ Create new Data and Dataset within R.
▪ Export processed Dataset out of R.
▪ Saving and running Rscripts.
▪ Errors are natural in programming. Just resolve them.

51

You might also like