Machine Learning and Econometrics EF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 270
At a glance
Powered by AI
The document discusses the differences between econometrics and machine learning approaches to statistical modeling, as well as methods used in machine learning like regularization, classification and regression trees, neural networks and their applications in econometrics.

The two cultures of statistical modeling discussed are the data modeling culture which assumes a given stochastic data model, and the algorithmic modeling culture which treats the data mechanism as unknown and uses algorithmic models.

According to the document, the focus of machine learning is to estimate the mean function with the aim of minimizing the reducible error.

Econometrics & Machine Learning

Emmanuel Flachaire
Aix-Marseille University, Amse

https://egallic.fr/ECB

December 21, 2021

Emmanuel Flachaire Econometrics & Machine Learning


Econometrics & Machine Learning

1 Introduction and General Principles


The two Cultures
Loss function and penalization
In-sample, out-sample and cross validation
2 Methods and Algorithms
Ridge and Lasso Regression
Classification and Regression Tree
Bagging and Random Forests
Boosting
Support Vector Machine
Neural Networks and Deep Learning
3 Using Machine Learning methods in Econometrics
Misspecification detection
Causal inference

Emmanuel Flachaire Econometrics & Machine Learning


1. Introduction and General Principle
The two Cultures
Loss function and penalization
In-sample, out-sample and cross validation

Emmanuel Flachaire Econometrics & Machine Learning


Statistical Modeling: The two Cultures1

There are two cultures in the use of statistical modeling to reach


conclusions from data:
Data Modeling Culture: one assumes that the data are
generated by a given stochastic data model (econometrics)

Algorithmic Modeling Culture: one uses algorithmic models


and treats the data mechanism as unknown (machine learning)

1
Léo Breiman, Statistical Science, 2001, Vol. 16, No. 3, 199-231
Emmanuel Flachaire Econometrics & Machine Learning
Statistical Modeling: The two Cultures

Léo Breiman (Statistical Science, 2001):

. . . an uncritical use of data models.

Emmanuel Flachaire Econometrics & Machine Learning


Misspecification bias
Let’s consider a quite general model:2 y = m(X ) + ε
Assume that X is fixed. The expected (squared) prediction
error, or EPE, is equal to

E (y − ŷ )2 = E [m(X ) + ε − m̂(X )]2


= E [m(X ) − m̂(X )]2 + Var(ε)
| {z } | {z }
Reducible Irreducible

The focus of Machine Learning is to estimate m with the aim


of minimizing the reducible error
Reducible error = MSE = [Bias(m̂(X ))]2 +Var(m̂(X ))
Assuming that the data are generated by a specific model, or
that the model is correctly specified, remains to assume that
the (misspecification) bias is zero: Bias(m̂)=0
2
y is a vector and X a matrix of observations, m a function, ε an error term
Emmanuel Flachaire Econometrics & Machine Learning
Misspecification bias: linear model

(Source: Berk, 2016)

Reducible error = mean


| function
{z error} + estimation error
misspecification bias

Emmanuel Flachaire Econometrics & Machine Learning


Misspecification bias: quadratic model

(Source: Berk, 2016)

Reducible error = mean


| function
{z error} + estimation error
misspecification bias

Emmanuel Flachaire Econometrics & Machine Learning


Econometrics and Machine Learning

Parametric econometric: we assume that the data come from


a generating process that takes the following form

y = Xβ + ε

→ probability theory is a foundation of econometrics

Machine learning: we do not make any assumption on how


the data have been generated

y ≈ m(X )

→ probability theory is not required

Nonparametric econometrics makes the link between the two


Machine Learning: an extension of nonparametric econometric

Emmanuel Flachaire Econometrics & Machine Learning


General Principle : optimization problem

Find the solution m


b to the optimization problem:
n
X
Minimize L(yi , m(Xi )) subject to kmk`q ≤ t (1)
m
i=1

which can be rewritten in Lagrangian form, for some λ ≥ 0:


n
X
Minimize L(yi , m(Xi )) + λ kmk`q (2)
m | {z }
i=1
| {z }
loss function penalization

The goal is to minimize a loss function under constraint


It is usually done by numerical optimization

Emmanuel Flachaire Econometrics & Machine Learning


General Principle : resolution by numerical optimization
Gradient Descent
Use linear approximations at each steps, from Taylor expansion

(Source: Watt et al., 2016)

Algorithm: Gradient descent


Input: differentiable function g , fixed step length α, initial point x 0
Repeat until stopping condition is met: w k = w k−1 − α g 0 (w k−1 )
Emmanuel Flachaire Econometrics & Machine Learning
General Principle : resolution by numerical optimization

Newton’s Method
Use quadratic approximations at each steps, from Taylor expansion3

(Source: Watt et al., 2016)


Converges in fewer steps than gradient descent in convex fct
Does not require step length to be determine
3
The second order Taylor series approximation centered at w k is equal to
h(w ) = g (w k ) + g 0 (w k )(w − w k ) + 21 g 00 (w k )(w − w k )2
Emmanuel Flachaire Econometrics & Machine Learning
General Principle : resolution by numerical optimization

Newton’s Method
Use quadratic approximations at each steps, from Taylor expansion

(Source: Watt et al., 2016)


It is used to find stationary points of a function: g 0 (w ) = 0.
It can lead to a maximum in concave function.

Emmanuel Flachaire Econometrics & Machine Learning


Regression: a simple Machine Learning method
Machine Learning (ML): solve the optimization problem
n
X
Minimize L(yi , m(Xi )) + λ kmk`q
m | {z }
i=1
| {z }
loss function penalization

Let us consider:
L = `2 (Euclidian distance): L(yi , m(Xi )) = (yi − m(Xi ))2
m is a linear function of parameters: yi ≈ Xi β with β ∈ R p
no penalization: λ = 0
Thus, we have: ( n
)
X
β
b = argmin (yi − Xi β)2 ,
i=1

It is the minimization of the Sum of Squared Residuals (SSR) in a


b is the OLS estimator.4
linear regression model, that is, β
4
A Gradient Descent method can be used to solve this optimization problem.
Emmanuel Flachaire Econometrics & Machine Learning
Linear regression from a Machine Learning perspective

Let us consider the simple linear regression model:

y = β0 + β1 x + ε (3)

From a Machine Learning perspective:


The linear regression provides the best straight line
approximation of the relationship between y and x 5
OLS estimators are obtained by minimizing prediction errors.
No probability theory is required!

Econometrics put statistical assumptions on (3) in order to derive


properties of the OLS estimators and to make inference.6

5
In the sense that it minimizes prediction errors
6
convergence, unbiased/biased estimators, BLUE, statistical tests, etc.
Emmanuel Flachaire Econometrics & Machine Learning
Classification: a simple Machine Learning method

(Source: Watt et al., 2016)


we aim to learn a hyperplane X β = 0 (shown here in black)
to separate feature representations of the two classes.7
left panel: perfect linear separation
right panel: two overlapping classes → minimize the number
of missclassified points that end up in the wrong half-space.
7
X = [ι, x1 , x2 ] is a n × 3 matrix.
Emmanuel Flachaire Econometrics & Machine Learning
Classification: the perceptron

A hyperplane placing the points on its correct side is as follows:

X β > 0 if yi = +1
X β < 0 if yi = −1

In other words, with y ∈ {−1, +1}:


if yi is correctly classified: yi (Xi β) > 0
if yi is missclassified: yi (Xi β) < 0
To minimize the aggregated distance of missclassified points to the
hyperplane, we can solve
n
X
Minimize max (0, −yi (Xi β)) ,
β
i=1

where max (0, −yi (Xi β)) is the perceptron or max loss function.
Emmanuel Flachaire Econometrics & Machine Learning
Classification: smooth version of the perceptron

(Source: Watt et al., 2016)


The perceptron loss function is non-differentiable (in green).8
The softmax loss function is a smooth approximation (black):9
g (s) = softmax(0, s) = log(1 + e s )
8
g (s) = max(0, s)
9
g (s) = log(1 + e s ). Gradient descent and Newton’s methods can be used
Emmanuel Flachaire Econometrics & Machine Learning
Classification: softmax and perceptron

(Source: Watt et al., 2016)


Minimizing the softmax loss function gives β,
b that define:
the linear separator X β
b = 0 shown in the left panel,
b − 1 shown in the right panel.
the surface y (x) = 2Λ(X β)
The softmax model is a smooth approximation of the perceptron
Emmanuel Flachaire Econometrics & Machine Learning
Classification: logit regression and perceptron
Minimize the softmax loss function:10
n
X  
Minimize log 1 + e −yi (Xi β) ,
β
i=1

is similar to maximize the log-likelihood in a logit model:


n
X
Maximize yi0 log Λ(Xi β) + (1 − yi0 ) log (1 − Λ(Xi β))
β
i=1

ex
with yi0 ∈ {0, 1} and Λ(x) = 1+e x = 1
1+e −x
is the logistic function11

→ Logit model = softmax model


→ The logit model is a smooth approximation of the perceptron
10
softmax(0, −yi (Xi β)) = log(1 + e −yi (Xi β) )
11
log Λ(Xi β) = − log(1 + e −Xi β ) and log(1 − Λ(Xi β)) = − log(1 + e Xi β )
Emmanuel Flachaire Econometrics & Machine Learning
Classification: a simple Machine Learning method
Machine Learning: solve the optimization problem
n
X
Minimize L(yi , m(Xi )) + λ kmk`q
m | {z }
i=1
| {z }
loss function penalization

Let us consider:12
the softmax loss function: L = softmax(0, −yi (Xi β))
no penalization: λ = 0.
Thus, we have:13 ( n )
X  
−y (X β)
βb = argmin log 1 + e i i ,
i=1

which is similar to maximize the log-likelihood in a logit regression


model, that is, β
b is the MLE estimator.
12
yi ≈ m(Xi ) = 1± (Xi β ≥ 0) = {+1 if Xi β ≥ 0 ; −1 if Xi β < 0}
13
A Gradient Descent method can be used to solve this optimization problem.
Emmanuel Flachaire Econometrics & Machine Learning
Logit regression from a Machine Learning perspective

Let us consider the logit regression model:14

E(y 0 |X ) = P(y 0 = 1) = Λ(X β) (4)

From a Machine Learning perspective:


The logit model is a smooth approximation of the perceptron
MLE estimator is obtained by minimizing classification errors.
No probability theory is required!

Econometrics put statistical assumptions on (4) in order to derive


properties of the MLE estimator and to make inference.

14
Since y 0 = {0, 1}, then E(y 0 |X ) = 0 × P(y 0 = 0) + 1 × P(y 0 = 1)
Emmanuel Flachaire Econometrics & Machine Learning
Linear/Logit models from a Machine learning perspective

Optimal parameters: Minimize prediction/classification errors


The convenience of convexity:

(Source: Watt et al., 2016)

A unique solution is easily obtained numerically/analytically.


Using probability theory, properties of the optimal parameters
are derived and inference can be drawn (Econometrics)

Emmanuel Flachaire Econometrics & Machine Learning


Moving beyond linearity: Regression

(Source: Watt et al., 2016)

Non-linearity (left panel) and interaction effects (right panel).


Knowledge-driven feature design are used in Econometrics.15
Automatic feature design is used in Machine Learning
15
fixed transformed covariates: quadratic, cubic, etc. and/or cross-products
Emmanuel Flachaire Econometrics & Machine Learning
Moving beyond linearity: Classification

(Source: Watt et al., 2016)

Non-linearity (left panel) and interaction effects (right panel).


Knowledge-driven feature design are used in Econometrics.16
Automatic feature design is used in Machine Learning
16
fixed transformed covariates: quadratic, cubic, etc. and/or cross-products
Emmanuel Flachaire Econometrics & Machine Learning
Nonparametric Econometrics

Machine Learning:
High non-linearity and strong interaction effects are taken into
account with automatic feature design.
In general, a non-convex function is minimized.

Nonparametric Econometrics:
A nonparametric regression take into account such effects.
It may work well in small-dimension, not in high dimension.17

Machine Learning is an extension of Nonparametric Econometrics.

17
Because of the curse of dimensionality. Note that GAM models may
capture automatically non-linearities, but not interaction effects.
Emmanuel Flachaire Econometrics & Machine Learning
1. Introduction and General Principle
The two Cultures
Loss function and penalization
In-sample, out-sample and cross validation

Emmanuel Flachaire Econometrics & Machine Learning


General Principle

Machine Learning: solve the optimization problem


n
X
Minimize L(yi , m(Xi )) + λ kmk`q
m | {z }
i=1
| {z }
loss function penalization

Choice of the loss function:


L → conditional mean, quantiles, expectiles
m → linear, logit, splines, tree-based models, neural networks

Choice of the penalization:


`q → lasso, ridge
λ → over-fitting, under-fitting, cross validation

Emmanuel Flachaire Econometrics & Machine Learning


Loss funct: Tradeoff between flexibility & interpretability

(Source: James et al., 2013)


Emmanuel Flachaire Econometrics & Machine Learning
Over-fitting

A model with high flexibility may fit perfectly observations used for
estimation, but very poorly new observations
4

estimation: λ=0

true model
2
0
y

−2
−4

0.0 0.2 0.4 0.6 0.8 1.0

→ penalization: put a price to pay for having a more flexible model

Emmanuel Flachaire Econometrics & Machine Learning


Under-fitting

If we put a huge cost for a more complex model, λ = ∞, we


obtain a linear regression model
4

estimation: λ=∞
2

true model
0
y

−2
−4

0.0 0.2 0.4 0.6 0.8 1.0

→ if the cost is too large: low variance, but high bias

Emmanuel Flachaire Econometrics & Machine Learning


Penalization: Tradeoff between bias & variance

Penalization: put a price to pay for a having more flexible model

λ = 0: it interpolates data . . . . . . . . . . . low bias, high variance


λ = ∞: linear model . . . . . . . . . . . . . . . . . high bias, low variance

→ the penalty parameter λ ≡ bias/variance tradeoff

Role of λ: avoid over-fitting and poor prediction with new data

Choice of λ: automatic selection procedures are based on model’s


performance evaluated out-sample, by cross-validation

Emmanuel Flachaire Econometrics & Machine Learning


1. Introduction and General Principle
The two Cultures
Loss function and penalization
In-sample, out-sample and cross validation

Emmanuel Flachaire Econometrics & Machine Learning


Model assessment

The best model has the lowest prediction error. With squared
error loss, the mean squared prediction error is equal to
n
1X SSR
MSE = b λ (xi ))2 =
(yi − m
n n
i=1

Due to overfitting, we cannot use SSR and R 2 based on the


sample used for estimation (≡ in-sample, training sample)

We are interested in the accuracy of the predictions obtained


from previously unseen data (≡ out-sample, test sample)

The in-sample MSE (training error) can be a poor estimate of


the out-sample MSE (test error)

Emmanuel Flachaire Econometrics & Machine Learning


Model assessment

In order to select the best model with respect to test error, we


need to estimate this test error (out-sample MSE)
There are two common approaches:
We can indirectly estimate test error by making an adjustment
to the training error to account for the bias due to overfitting
2
→ penalization ex-post . . . Radj , AIC, BIC
We can directly estimate the test error, using either a
validation set approach or a cross-validation approach
→ penalization ex-ante

CV provides a direct estimate of test error, makes fewer


assumptions about the true model and can be used widely

In the past, performing CV was computationally prohibitive.


Nowadays, the computations are hardly ever an issue
Emmanuel Flachaire Econometrics & Machine Learning
Out-sample: Validation set

(Source: James et al., 2013)


We split randomly the sample in two groups of observations: a
training set (q − 1 obs.) and a validation/test set (n − q obs)
n
1 X
1 estimation, n − q values → MSE = (yi − ŷi )2
n−q
i=q

Emmanuel Flachaire Econometrics & Machine Learning


Cross-Validation: LOOCV or n-fold CV18

(Source: James et al., 2013) n


1X
n estimations, n values → MSE = (yi − ŷi )2
n
i=1
18
LOOCV: leave-one-out cross-validation
Emmanuel Flachaire Econometrics & Machine Learning
Cross-Validation: K -fold CV

(Source: James et al., 2013)


K
1XX
K estimations, n values → MSE = (yi − ŷi )2
n
k=1 i∈Gk

Emmanuel Flachaire Econometrics & Machine Learning


Prediction error in-sample vs. out-sample
← underfit λ−1 overfit →

(Source: Hastie et al., 2009)

Underfitting: the model performs poorly on training and test samples


Overfitting: performs well on training sample, but generalizes poorly on test sample

Emmanuel Flachaire Econometrics & Machine Learning


Standardization and Normalization
Several ML methods are sensitive to the units of the covariates
as Ridge/Lasso regressions, SVM and Neural Networks

The results may differ substantially when multiplying a given


covariate by a constant (meters/kilometers, kilograms/grams)

It is best to standardize the data before using these methods:


x x − x̄
p or p
Var(x) Var(x)
so that they are all on the same scale

Normalization is another scaling technique where the values


end up ranging between 0 and 1:
x − xmin
xmax − xmin

Emmanuel Flachaire Econometrics & Machine Learning


2. Methods and Algorithms
Ridge and Lasso Regression
Classification and Regression Tree
Bagging and Random Forests
Boosting
Support Vector Machine
Neural Networks and Deep Learning

Emmanuel Flachaire Ridge and Lasso Regressions


Introduction

Linear regression

y = Xβ + ε n observations, p covariates

Least Squares
Collinearity or many irrelevant covariates → high variance
More covariates than observations, p > n → undefined

Ridge and Lasso


Collinearity, many irrelevant covariates → smaller variance
High-dimensional data analysis, p  n → feasible

Emmanuel Flachaire Ridge and Lasso Regressions


Shrinkage Methods

n
X p
X
2
Minimize (yi − α − Xi β) + λ |βj |q
α,β
i=1 j=1

Pp q
It is equivalent to minimize SSR subject to j=1 |βj | ≤c

No penalization correponds to OLS unbiased estimation


The penalization restricts the magnitude of the coefficients
It shrinks the coefficients toward 0 as λ % (or c &)
It introduces some bias in the coefficients

→ Add some bias if it leads to a substantial decrease in variance

Emmanuel Flachaire Ridge and Lasso Regressions


Shrinkage Methods

n
X p
X
Minimize (yi − α − Xi β)2 + λ |βj |q
α,β
i=1 j=1

Pp q
It is equivalent to minimize SSR subject to j=1 |βj | ≤c

Idea: biased coeff may result in model with smaller MSE


The penalty term λ is a bias-variance tradeoff
λ is selected by cross-validation (MSE out-sample)

Overall, use shrinkage methods when OLS exhibits large variance


(with many irrelevant or highly correlated covariates)

Emmanuel Flachaire Ridge and Lasso Regressions


Standardization

n
X p
X
Minimize (yi − α − Xi β)2 + λ |βj |q
α,β
i=1 j=1

Pp q
It is equivalent to minimize SSR subject to j=1 |βj | ≤c

The results are sensitive to the scale of the covariates


If a covariate is divided by 10, its coefficient is multiplied by
10, which has an impact on the constraint
It is best to standardize covariates before using shrinkage
methods, so that they are all on the same scale:
x x − x̄
p or p
Var(x) Var(x)

Emmanuel Flachaire Ridge and Lasso Regressions


Ridge Regression

n
X p
X
2
Minimize (yi − α − Xi β) + λ βj2
α,β
i=1 j=1

Ridge = shrinkage method based on the `2 norm (q = 2)

The restriction is convex and makes the problem easy to solve:

β̂ = (X > X + λIn )−1 X > y

where In is the n × n identity matrix


λ > 0: (X > X + λIn ) non-singular even if X is not of full rank
Ridge method is defined in high-dimensional problems p  n

Emmanuel Flachaire Ridge and Lasso Regressions


Lasso Regression

n
X p
X
Minimize (yi − α − Xi β)2 + λ |βj |
α,β
i=1 j=1

Lasso = shrinkage method based on the `1 norm (q = 1)

The restriction is convex and makes the problem easy to solve


numerically, but there is no close expression as in ridge
The nature of the constraint will cause some coefficents to be
exactly zero, with λ sufficiently large (or c sufficiently low)
Lasso makes variable selection with many irrelevant variables
Lasso is appropriate with sparse model, in which only a
relative small number of covariates play an important role

Emmanuel Flachaire Ridge and Lasso Regressions


Lasso vs. Ridge

(Source: Hastie et al., 2015)

Unlike the Ridge constraint, the Lasso constraint has corners


If the solution occurs at a corner, it has one parameter equal to O
Emmanuel Flachaire Ridge and Lasso Regressions
Lasso vs. Ridge

(Source: Hastie et al., 2015)


q q
The x-axis is the factor c, from |β1 | + |β2 | ≤ c, normalised to 1
Lasso: many coef. are exactly zero with low c → variable selection
Emmanuel Flachaire Ridge and Lasso Regressions
Lasso and Variable Selection

p
X
Lasso constraint: |βj | ≤ c
j=1

The optimal c for prediction and variable selection are different:

For variable selection, the optimal parameter c shrinks


non-zero coefficients toward zero → bias
For prediction, the optimal parameter c is often larger than
for selection, to reduce the bias on non-zero coefficients
Lasso selects λ or c by CV, based on MSE → for prediction
Lasso often includes too many variables (c is often too large)
But the true model is very likely a subset of these variables

Emmanuel Flachaire Ridge and Lasso Regressions


The one standard error rule

Breiman et al. (1984) proposed a rule-of-thumb:19

Lasso selects λ by CV, based on MSE → for prediction


Consider values of λ within a 1-standard error interval
Pick the largest value of λ within this interval (smallest c)

The main idea of the 1 SE rule is to choose the most parcimonious


model whose accuracy is comparable with the best model

It is a rule-of-thumb, expected to provide a value of λ in between


the optimal one for prediction and the optimal one for selection

19
Breiman, Friedman, Stone, Olshen (1984) Classification and Regression
Trees
Emmanuel Flachaire Ridge and Lasso Regressions
Simulation results with uncorrelated covariates

Linear regression model with many covariates, n = 1000


Monte Carlo simulation with 1000 replications
λ̂min is selected by CV, λ̂1se with the 1 SE rule
Potency: proportion of relevant variables selected
Gauge: proportion of irrelevant variables selected

→ Lasso with λ̂min selects 29.9% of irrelevant variables


→ Lasso with λ̂1se selects 3.2% of irrelevant variables, but MSE %

Emmanuel Flachaire Ridge and Lasso Regressions


Simulation results with correlated covariates

Linear regression model with many covariates, n = 1000


Monte Carlo simulation with 1000 replications
λ̂min is selected by CV, λ̂1se with the 1 SE rule
Potency: proportion of relevant variables selected
Gauge: proportion of irrrelevant variables selected

→ Lasso with λ̂min selects 64.3% of irrelevant variables


→ Lasso with λ̂1se selects 45.8% of irrelevant variables, MSE %

Emmanuel Flachaire Ridge and Lasso Regressions


Adaptive Lasso

The Adaptive Lasso is based on the following constraint:20


p
X
wj |βj | ≤ c where wj = 1/|β̂j |ν
j=1

where β̂j is the OLS estimate and ν > 0.


Put smaller weights to larger coefficients in the constraint
Large non-zero coefficients shrink more slowly to zero as c &
This leads to the oracle property, simultaneously achieving
Consistent variable selection
Optimal estimation prediction

ν is often set equal to 1, but it could be selected by CV


20
Zou (2006), JASA, 101 1418-1429
Emmanuel Flachaire Ridge and Lasso Regressions
Elastic-net

The Elastic-net is based on the following constraint:21


p
X
(r βj2 + (1 − r )|βj |) ≤ c
j=1

where r = 1 corresponds to the Ridge and r = 0 to the Lasso.


Lasso may perform poorly with highly correlated covariates,
which is often encountered in high-dimensional data analysis
By combining a `2 -penalty with the `1 -penalty, we obtain a
method that deals better with such correlated groups, and
tends to select the correlated covariates (or not) together.
Like Lasso, Elastic-net often includes too many covariates

21
Zou and Hastie
P (2005), JRSSP Series B, 67 301-320. It corresponds to the
penalization λ1 pj=1 βj2 + λ2 pj=1 |βj |, where λ1 and λ2 are selected by CV.
Emmanuel Flachaire Ridge and Lasso Regressions
Adaptive Elastic-net

The Adaptive Elastic-net is based on the following constraint:22


p
X
r βj2 + (1 − r )wj |βj | ≤ c


j=1

where wj = 1/(|β̂j | + n1 )ν and ν > 0.23

Adaptive Lasso has oracle property (consistent vble selection),


but inherits the instability of Lasso for high-dimensional data
Elastic-net deals better in high-dimensional data analysis,
but it lacks the oracle property
Adaptive Elastic-net combines the two approaches
22
Zou and ZhangP (2009) AnnalsPof Statistics, 37, 1733-1751. It remains to
the penalization λ1 pj=1 βj2 + λ2 pj=1 wj |βj |, where λ1 , λ2 are selected by CV
23
1/n in wj is used to avoid division by 0
Emmanuel Flachaire Ridge and Lasso Regressions
Application: Predict baseball player’s Salary
What predictors are associated with baseball player’s Salary?
Salary – 1987 annual salary on opening day in thousands of dollars;
Years – Number of years in the major leagues;
Hits – Number of hits in 1986;
Atbat – Number of times at bat in 1986;
...
1 l i b r a r y ( ISLR )
2 H i t t e r s=na . o m i t ( H i t t e r s )
3 x=model . m a t r i x ( S a l a r y ˜ . , H i t t e r s ) [ , − 1 ]
4 y=H i t t e r s $ S a l a r y
5 # R i d g e and L a s s o
6 l i b r a r y ( glmnet )
7 r i d g e . model=g l m n e t ( x , y , a l p h a =0)
8 l a s s o . model=g l m n e t ( x , y , a l p h a =1)
9 p a r ( mfrow=c ( 1 , 2 ) )
10 p l o t ( r i d g e . model , main=” R i d g e ” )
11 p l o t ( l a s s o . model , main=” L a s s o ” )

By default, the covariates are standardized, otherwise use the argument


standardize=FALSE in the function glmnet
Emmanuel Flachaire Ridge and Lasso Regressions
Application: Coefficient paths

Coefficient paths for Ridge and Lasso as c increases

Emmanuel Flachaire Ridge and Lasso Regressions


Application: Cross Validation
1 c v . r i d g e=c v . g l m n e t ( x , y , a l p h a =0)
2 c v . l a s s o=c v . g l m n e t ( x , y , a l p h a =1)
3 p l o t ( c v . r i d g e , main=” R i d g e ” )
4 p l o t ( c v . l a s s o , main=” L a s s o ” )

Emmanuel Flachaire Ridge and Lasso Regressions


Application: Adaptive Lasso and Adaptive Elastic-net
1 o l s=lm ( S a l a r y ˜ . , H i t t e r s )
2 w=1/ a b s ( c o e f ( o l s ) )
3 c v . a d a l a s s o <− c v . g l m n e t ( x , y , a l p h a =1 , p e n a l t y . f a c t o r=w)
4 c v . a d a e l a s t <− c v . g l m n e t ( x , y , a l p h a = . 5 , p e n a l t y . f a c t o r=w)
5 p l o t ( c v . a d a l a s s o , main=” A d a p t i v e L a s s o ” )
6 p l o t ( c v . a d a e l a s t , main=” A d a p t i v e E l a s t i c −n e t ” )

Emmanuel Flachaire Ridge and Lasso Regressions


Application: Compare the coefficients

1 c o e f 1=c o e f ( o l s )
2 c o e f 2=c o e f ( c v . r i d g e , s=” lambda . min ” )
3 c o e f 3=c o e f ( c v . l a s s o , s=” lambda . min ” )
4 c o e f 4=c o e f ( c v . l a s s o , s=” lambda . 1 s e ” )
5 c o e f 5=c o e f ( c v . a d a l a s s o , s=” lambda . min ” )
6 c o e f 6=c o e f ( c v . a d a e l a s t , s=” lambda . min ” )
7 o p t i o n s ( s c i p e n = 999) # d i s a b l e s c i e n t i f i c n o t a t i o n
8 c o e f f=c b i n d ( c o e f 1 , c o e f 2 , c o e f 3 , c o e f 4 , c o e f 5 , c o e f 6 )
9 c o l n a m e s ( c o e f f ) <− c ( ” o l s ” , ” r i d g e ” , ” l a s s o ” , ” l a s s o . 1 s e ” , ”
adaLasso ” , ” a d a E l a s t i c ” )
10 coeff

Emmanuel Flachaire Ridge and Lasso Regressions


Application: Compare the coefficients

Shrinkage methods: the coefficients are shrunk towards zero


Variable selection is sensitive to the method
Emmanuel Flachaire Ridge and Lasso Regressions
Summary

Ridge and Lasso ca be used in high-dimension (p  n)


They are based on a bias-variance tradeoff
Tradeoff selected minimizing MSE out-sample by CV
Sparse models: Lasso is a variable selection method
Ridge puts similar coefficients to strongly correlated variables,
while Lasso selects one randomly
Extension to adaptive Lasso and Elastic-net

Emmanuel Flachaire Ridge and Lasso Regressions


2. Methods and Algorithms
Ridge and Lasso Regression
Classification and Regression Tree
Bagging and Random Forests
Boosting
Support Vector Machine
Neural Networks and Deep Learning

Emmanuel Flachaire Classification and Regression Tree (CART)


Classification Tree
y ∈ {0, 1} is a qualitative variable

Emmanuel Flachaire Classification and Regression Tree (CART)


Classification Tree: Principle from a small sample

Find the best rule on a single variable to classify black/white balls

→ find a cutoff on x1 or x2 such that the maximum number of


observations is correctly classified 24

24
See https://freakonometrics.hypotheses.org/52776
Emmanuel Flachaire Classification and Regression Tree (CART)
Classification Tree: graphical representation
Minimizing misclassification, we find x2 < k, where k ∈ (0.56, 0.8)

This Figure represents the best split in a competition between all


possible splits of x1 and x2.
From this simple rule, two bullets are misclassified . . . we can try to
find a new rule in the white area sub-group . . .
Emmanuel Flachaire Classification and Regression Tree (CART)
Classification Tree: A sum of simple rules

The additional rule x1 ≥ c, where c ∈ (.16, .2), produces the best


subsequent split:

Using these two rules, all bullets are correctly classified

Emmanuel Flachaire Classification and Regression Tree (CART)


Classification Tree: Extension to large sample

Interpretation is quite easy and intuitive

We use recursive binary splitting to grow a tree

A tree can grow until every observations is correctly classified

With a large sample, a tree may have many nodes, that is,
many points where the predictor is splitted into two leaves

Note that this principle is easy to apply, even with several


regressors and several classes

Emmanuel Flachaire Classification and Regression Tree (CART)


Classification Tree: Example with 100 observations

The resulting tree is quite complex and not so easy to interpret

Emmanuel Flachaire Classification and Regression Tree (CART)


Classification: Tree pruning

An unpruned tree:
classifies correctly every observation from a training sample
may classify poorly observations from a test sample
(it is the standard problem of overfitting)
maybe difficult to interpret

A pruned tree:
is a smaller tree with fewer splits
might perform better on a test sample
might have an easier interpretation

→ define a criterion for making binary splits

Emmanuel Flachaire Classification and Regression Tree (CART)


Classification: Tree pruning

A fully grown tree fits perfectly training data, poorly test data

Tree pruning is used to control the problem of overfitting

A smaller tree with fewer splits might lead to lower variance


and better interpretation at the cost of a little bias

Poor strategy: Add new split only if it is worthwhile to do so

However, a poor split could be followed by a very good split

Good strategy: Grow a very large tree and prune it back in


order to obtain a subtree, keep a split only if it is worthwhile

→ We need to define what ”only if it is worthwhile” means

Emmanuel Flachaire Classification and Regression Tree (CART)


Classification tree: Gini impurity index
Classification: not only concerned by class prediction, also by
class proportion → Min impurity rather than misclassification

Gini impurity index at some node N :25


K
X K
X
G (N ) = pk (1 − pk ) = 1 − pk2
k=1 k=1

with pk the fraction of items labeled with class k in the node

Node: 100-0 or 0-100→ minimal impurity/diversity: G = 0,


Node: 50-50 → maximal impurity/diversity: G = 1/4.26

A small value means that a node contains predominantly


observations from a single class (homogeneity)
25
Another index is the Entropy “impurity” index E (N ) = − Kk=1 pk log pk
P
26
With 100-0: p1 = 1, p2 = 0 ; 0-100, p1 = 0, p2 = 1 ; 50-50: p1 = p2 = 1/2
Emmanuel Flachaire Classification and Regression Tree (CART)
Classification: Tree pruning

If we split the node into two leaves, NL and NR , the Gini


impurity index becomes

G (NL , NR ) = pL G (NL ) + pR G (NR )

where pL , pR are the proportion of observations in NL , NR

When do we split? . . . when impurity is reduced substantially:

∆ = G (N ) − G (NL , NR ) > 

we can also require a minimum of observations per node

How do we split? . . . find the cutoff on a single variable that


minimise impurity rather than misclassification (→ max ∆)

Emmanuel Flachaire Classification and Regression Tree (CART)


Classification tree: Limitation

Small change in the original sample ⇒ Tree may differ significantly


Emmanuel Flachaire Classification and Regression Tree (CART)
Application: Predict survival on the Titanic
Consider passenger data from the sinking of the Titanic
What predictors are associated with those who perished
compared to those who survived?
survived – 1 if true, 0 otherwise;
sex – the gender of the passenger;
age – age of the passenger in years;
pclass – the passenger’s class of passage;
sibsp – the number of siblings/spouses aboard;
parch – the number of parents/children aboard.

1 l i b r a r y (PASWR) # get the data


2 data ( t i t a n i c 3 )
3 library ( rpart ) # CART p a c k a g e
4 l i b r a r y ( rpart . plot ) # fancy plots
5 X=c b i n d ( s e x , age , p c l a s s , s i b s p , p a r c h )
6 t r e e <− r p a r t ( s u r v i v e d ˜X , d a t a=t i t a n i c 3 , method=” c l a s s ” )
7 p r p ( t r e e , e x t r a =1 , f a c l e n =5 , box . c o l=c ( ” i n d i a n r e d 1 ” , ”
aquamarine ” ) [ t r e e $ frame $ y v a l ] )

Emmanuel Flachaire Classification and Regression Tree (CART)


Application: Titanic classification tree

Emmanuel Flachaire Classification and Regression Tree (CART)


Application: Variable importance
The importance of each variable, related to the gain in Gini, is
1 t r e e $ v a r i a b l e . importance

1 sex pclass sibsp age parch


2 172.74924 50.78568 27.33127 20.95528 20.46938

that we can plot using


1 b a r p l o t ( t r e e $ v a r i a b l e . i m p o r t a n c e , h o r i z=TRUE, c o l=” y e l l o w 3 ” )

Emmanuel Flachaire Classification and Regression Tree (CART)


Regression Tree
y ∈ R is a quantitative variable

Emmanuel Flachaire Classification and Regression Tree (CART)


Regression Tree: Principle with one covariate
Find the best split, which minimizes deviations to the mean
(variances) in each leaf:

(yi − ȳL )2 + (yi − ȳR )2


P P
→ find cutoff on x such that: Min
xi ∈NL xi ∈NR

Emmanuel Flachaire Classification and Regression Tree (CART)


Regression Tree: Principle with one covariate
Then use recursive binary splitting:

Emmanuel Flachaire Classification and Regression Tree (CART)


Regression Tree: Principle with two covariates
With two covariates, y ≈ m(x1 , x2 ), we have:

(Source: James et al., 2013)


J P
Find boxes R1 , . . . , RJ that minimize SSR:1 Min (yi − ȳRj )2
P
j=1 xi ∈Rj
1
We cannot consider every possible partition → use recursive binary splitting
Emmanuel Flachaire Classification and Regression Tree (CART)
Application: Predict baseball player’s Salary

Let’s consider 3 covariates: y ≈ m(x1 , x2 , x3 )

What predictors are associated with baseball player’s Salary?


Salary – 1987 annual salary on opening day in thousands of dollars;
Years – Number of years in the major leagues;
Hits – Number of hits in 1986;
Atbat – Number of times at bat in 1986;

1 l i b r a r y ( ISLR )
2 # remove o b s e r v a t i o n s t h a t a r e m i s s i n g S a l a r y v a l u e s
3 d f=H i t t e r s [ c o m p l e t e . c a s e s ( H i t t e r s $ S a l a r y ) , ]
4 # l o a d CART l i b r a r y
5 library ( rpart )
6 library ( rpart . plot )
7 # estimate the t r e e
8 t r e e <− r p a r t ( l o g ( S a l a r y ) ˜ Y e a r s+H i t s+AtBat , d a t a=d f , cp =0)
9 # p l o t the t r e e
10 p r p ( t r e e , e x t r a =1 , f a c l e n =5)

Emmanuel Flachaire Classification and Regression Tree (CART)


Regression Tree: Principle with several covariates
With more covariates, we can only use the decision tree figure:

based on the same principle: find terminal nodes that min SSR
Emmanuel Flachaire Classification and Regression Tree (CART)
Regression: Tree pruning

A fully grown tree fits perfectly training data, poorly test data

Tree pruning is used to control the problem of overfitting

A smaller tree with fewer splits might lead to lower variance


and better interpretation at the cost of a little bias

Poor strategy: Build the tree only so long as the decrease in


the SSR due to each split exceed some threshold

However, a poor split could be followed by a very good split

Good strategy: Grow a very large tree and prune it back in


order to obtain a subtree, keep a split only if it is worthwhile

Emmanuel Flachaire Classification and Regression Tree (CART)


Regression: Tree Pruning

Penalization: we put a prize to pay for having a tree with


many terminal nodes J, or regions,
J X
X
Min (yi − ȳRj )2 + λ J
j=1 i∈Rj

For given λ, we can find the subtree minimizing this


criterion27

Cross-validation: we select λ using cross validation

A smaller tree with fewer splits might lead to lower variance


and better interpretation at the cost of a little bias

27
λ is called the complexity parameter
Emmanuel Flachaire Classification and Regression Tree (CART)
Tree pruning: Application
1 t r e e <− r p a r t ( l o g ( S a l a r y ) ˜ Y e a r s+H i t s+AtBat , d a t a=d f ) # CV
2 p r p ( t r e e , e x t r a =1 , f a c l e n =5)

Unpruned tree Pruned tree

Emmanuel Flachaire Classification and Regression Tree (CART)


Tree versus linear model

Tree vs. linear model: Which model is better?

It depends on the problem at hand:


K
P
Linear regression: m(X ) = β0 + Xj βj
j=1

J
P
Regression tree: m(X ) = cj 1(x ∈ Rj )
j=1

If the relationship between y and x1 , ..., xK is linear: a linear


model should perform better

If the relationship between y and x1 , ..., xK is highly non-linear


and complex: a tree model should perform better

Emmanuel Flachaire Classification and Regression Tree (CART)


True decision boundary: linear
1.0 logit tree

1.0
0.8

0.8
0.6

0.6
Variable 2

Variable 2
0.4

0.4
0.2

0.2
0.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Variable 1 Variable 1

Blue area ≡ fitted values in blue from linear (left) and tree (right) models

Emmanuel Flachaire Classification and Regression Tree (CART)


True decision boundary: nonlinear
1.0 logit tree

1.0
0.8

0.8
0.6

0.6
Variable 2

Variable 2
0.4

0.4
0.2

0.2
0.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Variable 1 Variable 1

Blue area ≡ fitted values in blue from linear (left) and tree (right) models

Emmanuel Flachaire Classification and Regression Tree (CART)


True decision boundary: interactions
1.0 logit tree

1.0
0.8

0.8
0.6

0.6
Variable 2

Variable 2
0.4

0.4
0.2

0.2
0.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Variable 1 Variable 1

Blue area ≡ fitted values in blue from linear (left) and tree (right) models

Emmanuel Flachaire Classification and Regression Tree (CART)


Classification And Regression Tree (CART)

Advantages:

Trees tend to work well for problems where there are


important nonlinearities and interactions

The results are really intuitive and can be understood even by


people with no experience in the field

Disadvantage:

Trees are quite sensitive to the original sample (non-robust)

They may have poor predictive accuracy out-sample

Emmanuel Flachaire Classification and Regression Tree (CART)


2. Methods and Algorithms
Ridge and Lasso regression
Classification and Regression Tree
Bagging and Random Forests
Boosting
Support Vector Machine
Neural Networks and Deep Learning

Emmanuel Flachaire Bagging and Random Forest


Bagging and Random Forest

How bagging and random forest work intuitively:

Based on your symptoms, suppose a doctor diagnoses an


illness that requires surgery

Instead asking one doctor, you may choose to ask several

If one diagnosis occurs more than any others, you may choose
this one as the final diagnosis

→ the final diagnosis is made based on a majority vote of doctors

Bagging and Random Forest: replace doctors by bootstrap samples

Emmanuel Flachaire Bagging and Random Forest


Bagging: algorithm

Algorithm 1: Bagging
Select number of trees B, and tree depth D;
for b = 1 to B do
generate a bootstrap sample from the original data;
estimate a tree model of depth D on this sample;
end

For instance, with the titanic dataset:


1 library ( rpart ) ; library ( rpart . plot )
2 l i b r a r y (PASWR) ; d a t a ( t i t a n i c 3 )
3 n = NROW( t i t a n i c 3 $ s u r v i v e d )
4 p a r ( mfrow=c ( 3 , 3 ) )
5 for ( i in 1:9) {
6 i d x = s a m p l e ( 1 : n , s i z e=n , r e p l a c e=TRUE)
7 c a r t = r p a r t ( a s . f a c t o r ( s u r v i v e d ) ˜ s e x+a g e+p c l a s s+s i b s p+
p a r c h , d a t a= t i t a n i c 3 [ i d x , ] , cp =0) # u n p r u n e d
8 p r p ( c a r t , t y p e =1 , e x t r a =1)}

Emmanuel Flachaire Bagging and Random Forest


Bagging: Generate several trees by bootstrapping
sex = mal
yes
sex = mal
no sex = mal no
yes no
yes

0
0 786 523
791 518 0
814 495
age >= 14 pclass = 3rd

pclass = 2nd,3rd pclass = 3rd age >= 13 pclass = 3rd 0 1


673 182 113 341

0 1
669 176 122 342 0 1 pclass = 2nd,3rd sibsp >= 2.5 sibsp >= 2.5
683 142 131 353 1
15 241

0 1 1
651 146 22 36 98 100
age >= 34 age >= 32 sibsp >= 2.5 pclass = 2nd,3rd sibsp >= 2 parch >= 1.5 age >= 57
1
20 240
pclass = 2nd age >= 54 parch >= 3.5
0 1 0
0 0 0
16 3 6 33 15 3
558 113 111 63 102 102 0 0 1 1
646 116 37 26 106 117 25 236 0 0 1
532 72 119 74 83 97

age >= 14 age < 48 sibsp < 0.5 age < 14 age >= 28 parch < 0.5 age >= 61 sibsp >= 1.5 age >= 23 sibsp < 0.5 age < 19 age < 52 age >= 1.5
0
0 1 1 0
153 9 33 2
0 0
6 1
129 5
32 0 5 26 6 9
0 0 1
0 0 1 0 1
429 108 94 40 17 23 21 5 81 97
0 0 0 1 1 379 63 86 72 77 96

512 71 134 45 30 15 76 102 19 227


age >= 40 age >= 37 age < 3.5
0 1 1
36 1 0 9 2 7
age < 29 sibsp >= 2 age >= 45 parch >= 0.5 age >= 28 parch < 0.5 age >= 32 age >= 36 age >= 28 age < 38 sibsp >= 0.5 parch < 0.5
1 0 1 0 0 0 0 1 0 0 1
343 62 86 63 75 89
5 11 19 0 2 5 38 12
40 13 17 0 16 2 5 112
0 0 0 0 0 1
398 87 31 21 63 18 31 22 12 12 43 85
0 0 0 1 1 1
parch < 0.5 parch >= 0.5 age < 23 age >= 7
472 58 117 45 14 13 50 58 26 44 14 115 0 0
46 3 6 1

0 0 0 1
297 59 40 20 46 43 69 88
age < 19 sibsp >= 0.5 age >= 10 age < 34 age >= 54 age >= 24 age >= 45 age < 34 sibsp < 0.5 parch < 0.5 age < 15 age >= 50
0 0 0 0 1 1 0 0 0 1 1 1
23 3 18 0 16 3 6 2 6 10 6 22 344 51 85 26 13 6 1 7 2 7 0 29 age < 39 age < 49 sibsp < 0.5 age < 15
0 0 0 1
0 0 1 0 1 1
337 56 61 31 8 18 45 18 15 19 37 63
0 0 1 0 1 1 9 8 9 1 6 2 0 8

128 7 32 19 48 51 16 15 10 29 14 86 0 0 1 1
288 51 31 19 40 41 69 80

age >= 20 pclass = 3rd age >= 37 age < 44 pclass = 2nd age >= 33 age >= 29 parch < 0.5
0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 1 0 1 1 0 0 1 1 0
35 0 19 3 5 4 3 14 8 0 11 5 4 14 7 3 30 60 48 0 32 12 0 7 36 35 12 16 10 2 6 13 4 3 6 26 3 5 4 3 27 12 4 7 4 12 7 1

0 0 0 0 0 1

302 56 42 28 37 18
0 1 284 48 36 29 62 79
80 7 11 81

age < 32 age >= 20


0 0 1 1

age < 20 age >= 33 age >= 22 21 0 30 18 6 11 10 25

0 1 0 1 0 1 1 0 1
286 40 9 12 35 11 2 7 77 2 3 5 3 52 263 48 52 54

0 0
1
16 16 33 16
8 29 age >= 28 age < 22
1 1
6 7 6 8

0 0

age < 31 age < 26 257 41 46 46

0 1 0 1
9 3 7 13 10 0 0 7
age < 27 age >= 27
0
1 0
42 6 9 0
0
23 16
8 22
0 1
215 35 37 46

0 1 0 1 0 1 0 1
14 2 9 14 4 3 4 19 210 29 5 6 17 13 20 33

sex = mal sex = mal sex = mal


yes no
yes no yes no

0
831 478 0 0
809 500 827 482

age >= 13 pclass = 3rd

age >= 14 pclass = 3rd age >= 13 pclass = 3rd

0 1
702 173 129 305

0 1 0 1
700 171 109 329 688 163 139 319
pclass = 2nd,3rd sibsp >= 2.5 parch >= 1.5
1
19 214

0 0 0 pclass = 2nd,3rd sibsp >= 2.5 age >= 5.5 pclass = 2nd,3rd sibsp >= 2.5 parch >= 0.5 pclass = 2nd

673 144 29 29 110 91 1


12 223

0 1 1 0 1 0 1
age < 18 age >= 55 age >= 37 age >= 20
677 137 23 34 97 106 661 132 27 31 113 86 26 233
0 1
27 1 2 28

0 0 0 1
548 77 125 67 39 10 71 81
age >= 32 age >= 55 parch >= 1.5 age >= 32 age >= 49 age >= 37 age >= 25 parch < 1.5
0 1 1 0 1 1
19 1 4 33 4 17 22 0 5 31 6 131
age >= 32 age < 49 age < 23 age >= 28 sibsp >= 0.5
0 0 0 0 0 0 0 0 0 1 1
33 1 37 4 16 0 549 88 128 49 93 89 523 82 138 50 58 20 55 66 20 102

0 0 0 0 1
515 76 88 63 23 10 55 54 16 27

age < 29 age < 48 age >= 17 age < 20 age < 48 sibsp >= 0.5 age < 22 age >= 22
0 0 0 0 0 0 0 1
age < 32 age >= 37 age < 23
139 8 25 3 23 4 129 5 40 6 16 0 18 8 0 28
0 1 0 1 0 0 1
130 15 9 18 19 4 4 6 25 13 9 8 7 19
0 0 1 0 0 0 1 1
0 0 1 410 80 103 46 70 85 394 77 98 44 42 20 37 58 20 74
385 61 79 45 30 41

age < 34 age >= 30 age >= 46 age < 22 age < 32 parch < 0.5 age >= 20 age < 27
0 1 0 0 1 0 1 1 0 1 0 1 1 1
380 51 5 10 38 7 9 6 21 35 360 58 7 10 2 7 63 1 1 6 39 9 3 11 5 12 2 28

0
0 0 1 0 0 1 1
41 38
50 22 96 36 68 78 331 76 97 38 32 46 18 46

age >= 32
1
age < 43 parch < 0.5 age >= 27 parch < 0.5 age >= 28 age >= 25
0 7
0 1 0 0 1 0 1 1
0 41 12 9 10 15 0 16 13 9 10 8 0 24 46 8 35
41 31
0 0 1 0 0 1
81 36 20 9 48 69 315 63 88 28 10 11

age < 31
0
7 0
age >= 39 age < 33 parch >= 1.5 age < 34
0
1 0 1 1 0 0 0 1
34 31
3 5 19 3 1 6 22 40 301 53 9 8 5 3 5 8

0 1 0 0
sibsp >= 0.5 78 31 26 29 14 10 79 20
1
2 9

0
32 22 age < 32 age >= 37 age >= 46
0 0 0 1 0
14 3 13 3 8 0 6 10 22 0

age >= 27
0 1 0
0
11 2 64 28 13 26 57 20

0
21 20

age >= 36 age >= 38


0 0 1 0
parch < 0.5 49 16 8 5 5 21 13 0
0
4 3 0 0
15 12 44 20
0
17 17

0 1 0 1
0 1
15 12 2 5 12 6 3 6 41 16 3 4

sex = mal
sex = mal no
yes no
sex = mal
yes
yes no
0
831 478

0
814 495 age >= 4.5 pclass = 3rd
0
778 531
0 1
696 156 135 322
age >= 9.5 pclass = 3rd
pclass = 2nd,3rd pclass = 3rd sibsp >= 2.5 age >= 9.5 pclass = 3rd
1
15 215

0 1 0 1 0
669 166 145 329 689 140 7 16 120 107
0 1
age >= 32 age >= 54 age >= 17 645 176 133 355
0 1 0
7 4 0 12 22 3
pclass = 2nd,3rd sibsp >= 3 sibsp >= 2.5 age >= 57
0 0 1
573 82 116 58 98 104
pclass = 2nd,3rd sibsp >= 2.5 sibsp >= 2.5
age < 32 age < 48 age < 22 1
0 1 0 1 0 0 1
649 133 20 33 129 106 16 223 146 9 27 5 8 20
15 244
0 0 0
427 73 89 53 90 84 0 1 0
632 157 13 19 118 111
age >= 46 age >= 3.5 age >= 17 sibsp >= 0.5 age >= 9.5 age >= 41 age >= 50 age >= 20 parch >= 3.5
0
0 1 0 1 8 8
531 72 2 27 21 1 3 11
0 0 1 0 1
419 65 80 41 9 12 28 7 62 77 age >= 55 age >= 28
0 0 0 1 0 0 1 0
118 61 18 6 108 105 13 212 age < 19 parch < 0.5 age < 19 age >= 28 502 93 12 3 1 16 14 0
0 0 0 1 0 0
11 6 13 2 6 5 3 7 15 0 6 1

0 0 0 1
0 1
parch >= 0.5 sibsp < 0.5 age >= 39 pclass = 2nd 408 59 67 39 13 7 56 76 130 64 104 111
0 1 1 1
13 0 5 6 6 23 5 130 parch < 0.5 age < 33 sibsp < 0.5 age < 26
0 1 0 1
74 2 7 8 10 3 3 4
0 0 0 1 age < 34 age < 31 age < 22
50 14 68 47 102 82 8 82 0
334 57
0
60 31
1
26 28
1
30 48 0
33 3
sibsp >= 0.5 pclass = 2nd age < 38 age < 33 sibsp >= 0.5
0 1 1
age < 48 parch < 0.5 parch >= 0.5 age < 22 age >= 16 51 21 9 10 2 7 0 0 1
0 0 1 0 0 0 1 1
97 61 41 17 63 94
19 0 19 3 1 58 318 49 16 8 13 8 13 20 28 41

0 0 1 0 1 age < 29
0 0 1 0 1 0 1 0 1
31 14 55 28 13 19 83 79 7 24 46 0 12 1 4 7 10 3 3 5 6 2 7 18 4 3 24 38 age >= 37 age >= 39 age >= 20
0 0 0 1
272 49 33 12 15 2 11 25
age >= 55 age < 26 age >= 20 parch < 0.5 age < 30 age >= 28 age >= 30
0 1 0 1 1 0 0 1
10 0 6 9 5 3 8 16 0 10 64 49 26 15 52 69
0 0
0 0 0 1 1 252 41 20 8

21 14 49 19 21 10 62 69 7 14
age < 25
0 0 1 age < 48 age < 15
20 0 16 3 4 5 1 0 1 0
age >= 36 sibsp < 0.5 age >= 24 0 3 12 18 4 8 11 13 2
232 41
0 1 0 0 1 0 1
15 5 6 9 7 0 9 0 10 25 7 4 0 10 0 1
pclass = 3rd
0 61 37 39 67
0 0 0 29 9

42 19 12 10 52 44 0
203 32

age < 20
parch >= 0.5 age >= 7
age < 32 0 0 1
20 8
0 0 1 0 1 47 20 18 47
9 5 9 6 3 4 48 32 4 12 0
183 24
1 0
0 age >= 22
14 17 21 20
33 14 0
11 0

0
172 24

0 1 0 1 0 1
31 8 2 6 0 0 8 1 6 16 12 4 9 16
34 3 30 7

Emmanuel Flachaire Bagging and Random Forest


Bagging: Why bootstrapping CART model?

Bagging = Bootstrap aggregating

Prediction:

Regression: average the resulting predictions


Classification: take a majority vote

Impact of bootstrapping:

Averaging a set of observations reduces variance28


It reduces variance and hence increase the prediction accuracy
Compared to CART, the results are much less sensitive to the
original sample, they show impressive improvement in accuracy
Loss of interpretability

28
The variance of the mean of the observations X̄ is given by σ 2 /n
Emmanuel Flachaire Bagging and Random Forest
Random Forest: algorithm

Algorithm 2: Random Forest


Select number of trees B, subsampling parameter m, tree
depth D;
for b = 1 to B do
generate a bootstrap sample from the original data;
estimate a tree model on this sample;
for each split do
Randomly select m of the original covariates (m < P);
Split the data with the best covariate (among the m);
end
end

→ Random Forest = Bagging + subsample covariates at each node


→ Bagging is a special case of Random Forest, with m = P

Emmanuel Flachaire Bagging and Random Forest


Bagging and Random Forest

Random Forest = Bagging + subsampling covariates at each node

Emmanuel Flachaire Bagging and Random Forest


Random forest: Why subsampling covariates?

Subsampling covariates may sound crazy, it has clever rationale:


Suppose there is one very strong covariate in the sample
Most or all trees will use this covariate in the top split
All of the trees will look quite similar to each other
Hence the predictions will be highly correlated

Averaging many highly correlated quantities does not lead to


a large reduction in variance

Random forests overcome this problem by forcing each split to


consider only a subset of the covariates

→ Random forests decorrelate the trees



In practice, default values: m = p/3 in regression and m = p in classification

Emmanuel Flachaire Bagging and Random Forest


Random forest: Overfitting

There is no much overfitting in random forests . . .

as B increases: average effect over trees → no overfitting

as D increases: overfitting is argued to be minor


”Segal (2004) demonstrates small gains in performance by controlling the
depths of the individual trees grown in random forests. Our experience
is that using full-grown trees seldom costs much, and results in one less
tuning parameter. Figure 15.8 shows the modest effect of depth control
in a simple regression example.” (Hastie et al., 2009, p.596)

The goal is to grow trees with as little bias as possible. The


high variance that would result from deep trees is tolerated
because of the averaging over a large number of trees

. . . However, a simple example shows that it can be problematic

Emmanuel Flachaire Bagging and Random Forest


Random forest: Overfitting . . . a simple example
Let us consider a realistic (simulated) sample
1 set . seed (1)
2 n=200
3 x=r u n i f ( n )
4 y=s i n ( 1 2 ∗ ( x +.2) ) / ( x +.2) + rnorm ( n ) / 2

We can fit CART and random forest models:29


5 f i t . t r <− r p a r t ( y ˜ x ) # CART
6 f i t . ba1 <− r a n d o m F o r e s t ( y ˜ x ) # no d e p t h c o n t r o l
7 f i t . ba2 <− r a n d o m F o r e s t ( y ˜ x , maxnodes =20) # d e p t h c o n t r o l

We can plot observations and predicted values:


8 u=s e q ( min ( x ) , max ( x ) , l e n g t h . o u t =1000)
9 p l o t ( x , y , c o l=” g r a y ” , main=”n=200” )
10 l i n e s ( u , p r e d i c t ( f i t . ba1 , d a t a . f r a m e ( x=u ) ) , c o l=” g r e e n ” )
11 l i n e s ( u , p r e d i c t ( f i t . ba2 , d a t a . f r a m e ( x=u ) ) , c o l=” r e d ” )
12 l i n e s ( u , p r e d i c t ( f i t . t r , d a t a . f r a m e ( x=u ) ) , c o l=” b l u e ” )

We run this code for n = 200 and n = 10000


29
Note that since it is a simple regression, with 1 covariate, then RF=bagging
Emmanuel Flachaire Bagging and Random Forest
Random forest: Overfitting . . . a simple example

→ improvement of random forest over a single regression tree


→ overfitting can be very large without controlling tree depth

Emmanuel Flachaire Bagging and Random Forest


Random forest: Out-of-bag (OOB) observations

No need to perform cross-validation:

By bootstrapping, each tree uses around 2/3 of the obs. The


remaining 1/3 obs are referred to as the out-of-bag (OOB) obs

Use OOB observations for out-sample predictions

We obtain around B/3 out-sample predictions for the i th obs.

average these values (or majority vote) = OOB prediction for i

An OOB-MSE can be computed over all OOB predictions

The OOB approach for estimating the test error is particularly


convenient with large sample, for which CV would be onerous

Emmanuel Flachaire Bagging and Random Forest


Random forest: Tuning parameters

We can use OOB-MSE to tune Random Forest parameters

Depth tree, D: from our previous example, with n = 10000


1 >r a n d o m F o r e s t ( y ˜ x ) $ mse [ 5 0 0 ] # OOB−MSE, no d e p t h c o n t r o l
2 [ 1 ] 0.3252183

3 >maxnode=c ( 1 0 , 5 0 , 1 0 0 , 5 0 0 , 1 0 0 0 , 2 0 0 0 )
4 > f o r ( i i n 1 :NROW( maxnode ) ) { # OOB−MSE, d e p t h c o n t r o l
5 > aa=r a n d o m F o r e s t ( y ˜ x , maxnodes=maxnode [ i ] ) $ mse [ 5 0 0 ] ;
6 > p r i n t ( c ( maxnode [ i ] , aa ) ) }
7 [ 1 ] 10.0000000 0.3747725
8 [ 1 ] 50.0000000 0.2553131
9 [ 1 ] 100.0000000 0.2508479
10 [ 1 ] 500.0000000 0.2570217
11 [ 1 ] 1000.000000 0.268357
12 [ 1 ] 2000.0000000 0.2921307

We can see that OOB-MSE is smaller with maxnode=100

Subsampling parameter, m: can be selected similarly

Emmanuel Flachaire Bagging and Random Forest


Random forest: Variable importance

Random forest improves prediction accuracy at the expense of


interpretability . . . the resulting model is difficult to interpret

One can obtain an overall summary of the importance of each


covariates using SSR (regression) or Gini index (classification)

Index: record the total amount that the SSR/Gini is decreased


due to splits over a given covariate, averaged over all B trees
1 > r f <− r a n d o m F o r e s t ( a s . f a c t o r ( s u r v i v e d ) ˜ s e x+a ge+p c l a s s+
s i b s p+p a r c h , d a t a=t i t a n i c 3 , na . a c t i o n=na . o m i t )
2 >i m p o r t a n c e ( r f )
3 MeanDecreaseGini
4 sex 133.75916
5 a ge 63.13448
6 pclass 52.45753
7 sibsp 18.74009
8 parch 17.49320

Emmanuel Flachaire Bagging and Random Forest


Random Forests compared to Single Trees (CART)

Source: Breiman (2001)

Emmanuel Flachaire Bagging and Random Forest


Bagging and Random Forest

Advantages:

They tend to work well for problems where there are


important nonlinearities and interactions.

They are robust to the original sample and more efficient than
single trees

Disadvantage:

The results are not intuitive and difficult to interpret.

Emmanuel Flachaire Bagging and Random Forest


Bagging and Random forest: Exercise
Consider the dataset used to predict baseball player’s salary:
1 l i b r a r y ( ISLR )
2 d f=H i t t e r s [ c o m p l e t e . c a s e s ( H i t t e r s $ S a l a r y ) , ]

Create a training set consisting of the first 200 observations, and a


test set consisting of the remaining observations
Perform bagging on the training set for a range of values of the tree
depth D, with B = 1000 trees. Produce a plot with D on the x-axis
and the corresponding test set MSE on the y -axis
Perform random forest on the training set with B = 1000 trees for
several values of the subsampling parameter m, and compute the
corresponding test set MSEs
Compare the test MSE of bagging and random forest to the test
MSE that results from a CART model
Which variables appear to be the most important predictors in the
random forest model?

Emmanuel Flachaire Bagging and Random Forest


2. Resampling-based Methods and Algorithms

Classification and Regression Tree (CART)


Bagging and Random Forests
Boosting
Support Vector Machine
Neural Networks and Deep Learning

Emmanuel Flachaire Boosting


Boosting: Principle

Like bagging, boosting involves combining a large number of


decision trees, but the trees are grown sequentially
Boosting does not involve bootstrap sampling; instead each
tree is fit on a modified version of the original data set:
Each tree is fit to the residuals from the previous tree model
Each iteration is then focused on improving previous errors 30
Each tree is shallow (low depth): ”weak” classifier/predictor 31

Boosting combines the outputs of many ”weak” learners


(classifiers, predictors) to produce a powerful ”committee”

30
Each subsequent model pays more attention to the errors from previous
models . . . it is a process that learns from past errors
31
Weak classifier: its error rate is only slightly better than random guessing
Emmanuel Flachaire Boosting
Boosting for Regression
y ∈ R is a quantitative variable

Emmanuel Flachaire Boosting


Boosting: algorithm

Algorithm 3: Boosting for regression trees


Select number of trees B, tree depth D, shrinkage parameter λ;
Set initial predicted values, m̂(x) = 0;
for b = 1 to B do
Compute the residuals, r = y − m̂(x);
Fit a regression tree m̂b (x) of depth D to the data (r , x);
Update the predicted values: m̂(x) ← m̂(x) + λ m̂b (x);
end

→ By fitting trees to the residuals, we seek to improve m̂ in areas


where it does not perform well

Emmanuel Flachaire Boosting


Boosting: Overfitting

Number of trees, B:

The role of each new (sequential) tree is to improve the fit


Unlike random forests, boosting can overfit if B is too large 32

Depth of trees, D:

In CART, fully-grown or deep trees are known to overfit


Boosting can then overfit if D is too large
Depth tree is usually very small, by default it is often D = 1

→ B and D can be selected by cross-validation

32
By averaging over a large number of trees, bagging and random forests
reduces variability. Boosting does not average over the trees
Emmanuel Flachaire Boosting
Boosting: Shrinkage

Idea behind shrinkage:

Slow down the boosting process to avoid overfitting . . . scale


the contribution of each tree by a factor 0 < λ < 1
A smaller λ typically requires more trees B. It allows more
and different shaped trees to attack the residuals33

→ Small values of D and λ: by fitting small trees to the residuals,


we slowly improve m̂ in areas where it does not perform well 34
→ The boosting approach learns slowly (λ = learning rate)
→ Statistical methods that learn slowly tend to perform well

33
Typical values are λ = 0.01, or λ = 0.001
34
By default, D = 1 and λ = 0.1 in the gbm function in R
Emmanuel Flachaire Boosting
Boosting: a simple regression example
Let us consider a realistic (simulated) sample
1 set . seed (1)
2 n=200
3 x=r u n i f ( n )
4 y=s i n ( 1 2 ∗ ( x +.2) ) / ( x +.2) + rnorm ( n ) / 2

We can fit CART and boosting models:


5 l i b r a r y ( gbm )
6 nb=500
7 # By d e f a u l t : i n t e r a c t i o n . d e p t h=1 and s h r i n k a g e =0.1
8 f i t . bo <− gbm ( y ˜ x , d i s t r i b u t i o n =” g a u s s i a n ” , n . t r e e=nb )
9 f i t . t r <− r p a r t ( y ˜ x )

We can plot observations and predicted values:


8 u=s e q ( min ( x ) , max ( x ) , l e n g t h . o u t =1000)
9 p l o t ( x , y , c o l=” g r a y ” , main=”n=200” , x l a b=NA, y l a b=NA)
10 l i n e s ( u , p r e d i c t ( f i t . t r , d a t a . f r a m e ( x=u ) ) , c o l=” b l u e ” )
11 l i n e s ( u , p r e d i c t ( f i t . bo , d a t a . f r a m e ( x=u ) , n . t r e e s=nb ) , c o l=”
purple ”)

We run this code for n = 200 and n = 10000


Emmanuel Flachaire Boosting
Boosting: a simple regression example

→ Boosting provides nice improvement over single regression tree

Emmanuel Flachaire Boosting


Boosting: Exercise

Consider the previous simple regression example:


Re-run the code with D = 4, B = 1000 and λ = 1. Do you
observe overfitting?
Perform boosting with different values of D, B and λ and
look how sensitive the results are to these choices

Consider the random forest exercise, on baseball player’s salary:


Perform boosting on the training set for a range of values of
the shrinkage parameter λ, with B = 1000 trees and D = 1.
Produce a plot with different shrinkage values on the x-axis
and the corresponding test set MSE on the y-axis.
Compare the test MSE of boosting to the test MSE that
results from bagging, random forest and CART model

Emmanuel Flachaire Boosting


Boosting for Classification
y ∈ {−1, 1} is a qualitative variable

Emmanuel Flachaire Boosting


AdaBoost algorithm
Algorithm 4: AdaBoost
Select the number of trees B, and the tree depth D;
Set initial weights, wi = 1/n;
for b = 1 to B do
Fit a classification tree m̂b (x) to the data using weights wi ;
Update the weights: % wi if misclassified, & wi
otherwise † ;
end
Output: ŷi = sign ( B
P
b=1 αb m̂b (x))
† If i is misclassified: wi ← wi e αb , where αb = log( 1−err
err
b
) and errb is the model’s
Pn b
i=1 wi I (yi 6=m̂b (xi ))
misclassification error, errb = Pn
w
. If i is correctly classified wi ← wi .
i=1 i

→ Observ. misclassified have more influence in the next classifier


→ In the output, the contributions from classifiers that fit the data
better are given more weight (a larger αb means a better fit)
Emmanuel Flachaire Boosting
Schematic illustration of the boosting framework

(Source: Bishop 2006, Pattern recognition and machine Learning, Figure 14.1)

Emmanuel Flachaire Boosting


Boosting vs. bagging

(Source: Internet, @@)

→ Bootstrap samples ≡ Original sample reweighted independently

Emmanuel Flachaire Boosting


Illustration of boosting for classification tree

(Source: Bishop (2006), Pattern recognition and machine Learning, Figure 14.2)

Emmanuel Flachaire Boosting


Generalizations into a unifying framework

Breiman referred to AdaBoost with trees as the ”best


off-the-shelf classifier in the world” (NIPS Workshop, 1996)

Friedman et al. (2000) show that Adaboost fits an additive


model in a base learner, optimizing a novel exponential loss
function, which is very similar to the binomial log-likelihood

They proposed generalizations into a unifying framework,


which includes several loss functions that can be used

They describe loss functions for regression and classification


that are more robust than squared error or exponential loss

→ Gradient boosting

Emmanuel Flachaire Boosting


Stochastic gradient boosting
Algorithm 5: Stochastic gradient boosting
Select number of trees B, tree depth D, shrinkage parameter λ;
for b = 1 to B do
Compute the gradient vector, ri = −∂L(yi , m(xi ))/∂m(xi );
Draw a subset of the original sample (r ∗ , x ∗ );
Fit a regression tree mb (x) of depth D to the data (r ∗ , x ∗ );
Update the predicted values: m(x) ← m(x) + λ mb (x);
end

→ Gradient boosting: Depending the choice of the loss function,


we consider a specific regression or classification model
→ Stochastic:
Shrinkage: slow down the boosting process to avoid overfitting
Subsampling: it reduces the computing time and, in many
cases, it produces a more accurate model (see random forest)
Emmanuel Flachaire Boosting
Loss functions in regression, y ∈ R
Squared error loss function:
1
L= (yi − m(xi ))2
2
for which the gradient vector is the residuals ri = yi − m(xi )
Absolute error loss function, or Laplacian:35

L = |yi − m(xi )|

→ median of the conditional distribution . . . robust regression


Huber loss function: a robust alternative to absolute error loss,
(
1
(yi − m(xi ))2 |yi − m(xi )| ≤ δ
L= 2
δ(|yi − m(xi )| − δ/2) |yi − m(xi )| > δ
35
We can also derive a quantile loss function: L = (1 − α)|yi − m(xi )| if
|yi − m(xi )| ≤ 0, and L = (1 − α)|yi − m(xi )| otherwise (α: desired quantile)
Emmanuel Flachaire Boosting
Loss functions in regression: A comparison

(Source: Hastie et al., 2009)

→ When robustness is a concern, squared error is not the best criteria

Emmanuel Flachaire Boosting


Loss functions in classification, y ∈ {−1, 1}
Misclassification loss function:36

L = 1(sign[m(x)] 6= y )

Adaboost loss function:

L = e −ym(x)

Bernouilli loss function, or Binomial deviance:

L = log(1 + e −2ym(x) )

→ Minimizing Adaboost or Bernouilli loss functions leads to the


same solution at the population level . . . not in finite sample
→ Bernouilli loss function is more robust to outliers in finite sample
36
The sign of m(xi ) implies that observations with yi m(xi ) > 0 (< 0) are
classified correctly (misclassified)
Emmanuel Flachaire Boosting
Loss functions in classification: A comparison
← misclassified correctly classified →

(Source: Hastie et al., 2009)

→ More weight for obs. more clearly misclassified (large negative ym(x))
→ When robustness is a concern, exponential loss is not the best criteria
Emmanuel Flachaire Boosting
Tuning parameters

The number of trees B. Unlike bagging and random forests,


boosting can overfit if B is too large, although this overfitting
tends to occur slowly if at all.

The number of splits D in each tree, which controls the


complexity of the boosted ensemble. Often D = 1 works well,
in which case each tree is a stump, consisting of a single split.

The shrinkage parameter λ. This controls the rate at which


boosting learns. Typical values are 0.01 or 0.001, and the
right choice can depend on the problem.37

→ We use cross-validation to select B, D and λ

37
Very small λ can require very large B in order to achieve good performance
Emmanuel Flachaire Boosting
Boosting: Interpretation

Single tree are highly interpretable. Linear combinations of


trees must therefore be interpreted in a different way.

Variable importance: using the relative importance of a


variable for a single tree,38 we then average over the trees39

After the most relevant variables have been identified, the


next step is to attempt to understand the nature of the
dependence of the approximation m(X ) on their joint values

Partial dependence plot illustrate the marginal effect of the


selected variables on the response after integrating out the
other variables.
38
The squared relative importance of Xl is the sum of squared improvements
over all internal nodes for which it was chosen as the splitting variable
39
Due to the stabilizing effect of averaging, this measure turns out to be
more reliable than is its counterpart (10.42) for a single tree
Emmanuel Flachaire Boosting
Application: spam email

The data for this example consists of information from 4601 email
messages, in a study to try to predict whether the email was spam.
The response variable is binary, with values email or spam, and
there are 57 predictors as described below:
48 quantitative predictors: the percentage of words in the email
that match a given word40
6 quantitative predictors: the percentage of characters in the email
that match a given character (; ! # ( [ $)
Uninterrupted sequences of capital letters: average length (CAPAVE),
length of the longest (CAPMAX), sum of the length (CAPTOT)

→ use gradient boosting to design an automatic spam detector that


could filter out spam before clogging the users’ mailboxes

40
Examples include business, address, internet, free, and george. The
idea was that these could be customized for individual users (Hastie et al, 2009)
Emmanuel Flachaire Boosting
Application: spam email

(Source: Hastie et al., 2009)

Emmanuel Flachaire Boosting


Application: partial dependence

(Source: Hastie et al., 2009)

→ effect of Xj on m(X ) after accounting for the average effects of the other variables

Emmanuel Flachaire Boosting


Application: joint frequencies

→ This plot displays strong interaction effects

Emmanuel Flachaire Boosting


2. Methods and Algorithms
Ridge and Lasso Regression
Classification and Regression Tree
Bagging and Random Forests
Boosting
Support Vector Machine
Neural Networks and Deep Learning

Emmanuel Flachaire Support Vector Machine (SVM)


Introduction

A method developped in the computer science community in


the 1990s

It uses a basis expansion to capture non-linear class


boundaries

Well suited for classification of complex but small- or


medium-sized datasets

Often considered one of the best ”out of the box” classifiers

Emmanuel Flachaire Support Vector Machine (SVM)


Support Vector Classifier
The separable case

Emmanuel Flachaire Support Vector Machine (SVM)


Classification and hyperplane

Source: James et al. (2013)

A hyperplane separates the space in two halves:

β0 + X1 β1 + X2 β2 > 0 (blue) or < 0 (red)

An ∞ number of hyperplanes, with same classification score


What would make a difference is their capacity to generalize

Emmanuel Flachaire Support Vector Machine (SVM)


The Maximal Margin Classifier
Margins = two parallel separating hyperplanes, located at the
smallest distance from the observations of each classes

Margins: the dashed lines

Support vectors: the two blue


points and the purple point
that lie on the margins

Optimal hyperplane: solid line

Source: James et al. (2013)

Principle: Maximize the distance between the two margins


The maximal margin (or optimal) hyperplane is the separating
hyperplane that is farthest from the training observations
Emmanuel Flachaire Support Vector Machine (SVM)
How to find the maximal margin?

It is an optimization problem:

maximize M subject to yi (Xi β) ≥ M, ∀i = 1, . . . , n


β,kβk=1

Here, kβk = 1 ensures that the perpendicular distance from the i th


observation to the hyperplane is given by yi (Xi β). Thus, the
restriction ensures that each observation is on the correct side of
the hyperplane and at least a distance M from the hyperplane.
It is equivalent to:41

minimize kβk2 subject to yi (Xi β) ≥ 1, ∀i = 1, . . . , n


β

41
We get rid of kβk = 1 by replacing the restriction with yi (Xi β) ≥ Mkβk
and, by setting kβk = 1/M, see Hastie et al (2009, section 4.5.2)
Emmanuel Flachaire Support Vector Machine (SVM)
Sensitivity to individual observations
Adding one blue observation leads to a quite different hyperplane,
with a significant decrease of the distance between the two margins

→ It could be worthwhile to misclassify a few training observations


in order to obtain a better generalization (out-sample classification)

Emmanuel Flachaire Support Vector Machine (SVM)


Support Vector Classifier (SVC)

Why should we consider a classifier that is not a perfect separator?


In the interest of:
greater robustness to individual observations
better classification of the out-sample observations

Underlying principles:42
SVC: maximal margin classifier, tolerating margin violations
Logit: minimize misclassification error

42
Figures: logit ≈ SVC (left), logit=solid line & SVC=dashed line (right)
Emmanuel Flachaire Support Vector Machine (SVM)
How to tolerate margin violations?
It is a slightly modified optimization problem:

maximize M subject to yi (Xi β) ≥ M(1 − i ),


β,kβk=1
Pn
and i ≥ 0, i=1 i ≤ C

∀i = 1, . . . , n, where C is a nonnegative tuning parameter.


1 , . . . , n are slack variables that allow observations to be on
the wrong side of the margin (i > 0) or hyperplane (i > 1)
C is a budget for the amount that the margins can be violated
C = 0: no margin violation is tolerated
as C increases, we become more tolerant of margin violations
C is the maximal number of observations allowed to be on the
wrong side of the hyperplane
In practice, C is a tuning parameter chosen by cross-validation
Emmanuel Flachaire Support Vector Machine (SVM)
Support Vector Classifier vs. Logit model
SVC: the previous optimization problem can be rewritten as:43
n
X K
X
minimize max [0, 1 − yi (Xi β)] + λ βk2
β | {z }
i=1 hinge loss function k=1

It’s a minimization of the hinge loss function with penalization


Logit: minimizing missclassification, we have:
n
X  
minimize log 1 + e −yi (Xi β) ,
β
i=1 | {z }
softmax function

It’s a minimization of the softmax function, no penalization


→ SVC ≈ penalized Logit model, using a hinge loss function
→ Role of penalization = tradeoff min missclassif. & max margin
43
With λ = 1/(2C ), see Hastie et al. (2009)
Emmanuel Flachaire Support Vector Machine (SVM)
SVC and Logit: loss function

Source: James et al. (2013) yi (Xi β)


Overall, the two loss functions have quite similar behavior
Hinge loss = 0 for obs on the correct side of the margin: yi (Xi β) > 1

Emmanuel Flachaire Support Vector Machine (SVM)


SVC and Logit: The separable case

Left: Logit ≈ SVC with C = 0


Right: Logit 6≈ SVC with C > 0 (chosen by cross-validation)
→ SVC = tradeoff between min missclassification & max margin
44

44
max margin: pushing away the obs. as far as possible from the hyperplane
min missclassif: smallest aggregated distance from the hyperplane of wrong obs
Emmanuel Flachaire Support Vector Machine (SVM)
Support Vector Classifier
The non-separable case

Emmanuel Flachaire Support Vector Machine (SVM)


The non-separable case

In the non-separable case, some observations are on the wrong


side of the hyperplane

The Maximal Margin Classifier has no solution

Logit minimizes the aggregated distance from the hyperplane


of the missclassified observations, not the number of missclass.
SVC is a tradeoff between:
minimizing the aggregated distance from the hyperplane of the
missclassified observations
pushing away as far as possible from the hyperplane the
correctly classified observations

Emmanuel Flachaire Support Vector Machine (SVM)


SVC and Logit: The non-separable case

Left: Logit 6≈ SVC with C small


Right: Logit 6≈ SVC with C chosen by cross-validation
SVC: 1 mistake - Logit: 3 mistakes
Emmanuel Flachaire Support Vector Machine (SVM)
Support Vector Machine
Nonlinear separability

Emmanuel Flachaire Support Vector Machine (SVM)


Support Vector Machine (SVM)

Many datasets are not linearly separable

Adding polynomial features and interactions can be used

But a low polynomial degree cannot deal with very complex


datasets

The support vector machine (SVM) is an extension of the


support vector classifier that results from enlarging the feature
space in a specific way, using kernels.

SVM works well for complex but small- or medium-sized


datasets

Emmanuel Flachaire Support Vector Machine (SVM)


Moving into higher dimension

Find a SVM classifier to identify teenagers from the height:45

 
x−150 x−150 2

Using the projection ϕ : x 7→ 10 , 10 , we obtain:

The data are linearly separable in the 2-dimensional space


45
Source: Internet
Emmanuel Flachaire Support Vector Machine (SVM)
The kernel trick
The data are not linearly separable in the 2-dimensional space, S

The kernel trick: Source: https://freakonometrics.hypotheses.org/52775

The data are linearly separable in the 3-dimensional space, S 0


Emmanuel Flachaire Support Vector Machine (SVM)
SVM: The optimization problem

It is the SVC optimization problem, with transformed covariates:

maximize M subject to yi (ϕ(Xi )β) ≥ M, ∀i


β,kβk=1
n
X K
X
or minimize max [0, 1 − yi (ϕ(Xi )β)] + λ βk2
β
i=1 k=1

In the resolution, ϕ only appears in the form ϕ(Xi )> ϕ(Xj ). Thus,

we don’t need to express explicitely ϕ


we don’t need to express the higher dimension space S 0

We use a kernel function defined as K (x, x 0 ) = ϕ(x)> ϕ(x 0 )

Emmanuel Flachaire Support Vector Machine (SVM)


Polynomial kernel

The kernel should be a symmetric positive (semi-) definite function.

The dth-degree polynomial kernel is: K (x, x 0 ) = (1 + hx, x 0 i)d

1st-degree polynomial kernel with two covariates X1 and X2 :46

K (X , X 0 ) = (1 + hX , X 0 i) = (1 + X1 X10 + X2 X20 )

With ϕ(X ) = {1, X1 , X2 }, we have K (X , X 0 ) = ϕ(X )> ϕ(X 0 ).


It corresponds to the linear case, or SVC.

→ SVM with 1st-degree polynomial kernel is similar to SVC

Pn
46
With two n-vectors, the inner product is: hx1 , x2 i = x1> x2 = i=1 xi yi
Emmanuel Flachaire Support Vector Machine (SVM)
Polynomial kernel

The dth-degree polynomial kernel is: K (x, x 0 ) = (1 + hx, x 0 i)d

2nd-degree polynomial kernel with two covariates X1 and X2 :

K (X , X 0 ) = (1 + hX , X 0 i)2 = (1 + X1 X10 + X2 X20 )2


= 1 + 2X1 X10 + 2X2 X20 + (X1 X10 )2 + (X2 X20 )2 + 2X1 X10 X2 X20
√ √ √
Here, ϕ(X ) = {1, 2X1 , 2X2 , X12 , X22 , 2X1 X2 } defines a
6-dimensional space, with squared and interaction terms

We move from 3-dimensional space to 6-dimensional space

→ SVM with dth-degree polynomial kernel (d ≥ 2) is similar to


SVC with additional powers and interaction terms of the covariates

Emmanuel Flachaire Support Vector Machine (SVM)


Radial kernel

Radial basis function (RBF) kernel: K (x, x 0 ) = exp(−γkx − x 0 k2 )

γ > 0 accounts for the smoothness of the decision boundary47


It is returns values between 0 and 1:
It returns large value for x close to x 0
It returns small value for x far from x 0

It is a similarity measure between two covariates

→ The radial kernel has a local behavior

47
Bias-variance tradeoff: large value of γ leads to high variance (overfitting),
small value leads to low variance and smoother boundaries (underfitting)
Emmanuel Flachaire Support Vector Machine (SVM)
Illustration: Simulated data
1 set . seed (1)
2 x=m a t r i x ( rnorm ( 2 0 0 ∗ 2 ) , n c o l =2)
3 x [1:100 ,]= x [1:100 ,]+2
4 x [101:150 ,]= x [101:150 ,] −2
5 y=c ( r e p ( 1 , 1 5 0 ) , r e p ( 2 , 5 0 ) )
6 p l o t ( x [ , 2 ] , x [ , 1 ] , pch =16 , c o l=y ∗ 2 )

Non-linear decision boundaries → SVC will perform poorly


Emmanuel Flachaire Support Vector Machine (SVM)
Illustration: Fit SVM with polynomial and radial kernels

We can fit a SVM with 2nd-degree polynomial kernel and fixed


cost of constraints violation:
7 l i b r a r y ( e1071 )
8 d a t=d a t a . f r a m e ( x=x , y=a s . f a c t o r ( y ) )
9 s v m f i t=svm ( y ˜ . , d a t a=dat , k e r n e l=” p o l y n o m i a l ” , c o s t =1 , d e g r e e =2)
10 p l o t ( s v m f i t , dat , g r i d =200)

Or select the cost parameter by 10-fold CV among several values:


11 t u n e . o u t=t u n e ( svm , y ˜ . , d a t a=dat , k e r n e l=” p o l y n o m i a l ” , d e g r e e =2 ,
r a n g e s= l i s t ( c o s t=c ( . 1 , 1 , 1 0 , 1 0 0 ) ) )
12 p l o t ( t u n e . o u t $ b e s t . model , dat , g r i d =200)
13 summary ( t u n e . o u t )

Similarly, we can fit a SVM with radial kernel:48


14 t u n e . o u t=t u n e ( svm , y ˜ . , d a t a=dat , k e r n e l=” r a d i a l ” , r a n g e s= l i s t (
c o s t=c ( . 1 , 1 , 1 0 , 1 0 0 ) , gamma=c ( . 5 , 1 , 2 , 3 , 4 ) ) )
15 p l o t ( t u n e . o u t $ b e s t . model , dat , g r i d =200)

48
We then have 2 tuning parameters, the cost of constraints violation and γ
Emmanuel Flachaire Support Vector Machine (SVM)
Illustration: Polynomial vs. radial kernels

Either kernel is capable of capturing the decision boundary


However, the results are different

Emmanuel Flachaire Support Vector Machine (SVM)


ROC curve

With more than 2 covariates, we can’t plot decision boundary

We can produce a ROC curve to analyze the results

SVM doesn’t give probabilities to belong to classes, as in logit

We compute scores of the form fˆ(X ) = ϕ(Xi )β̂ for each obs.
Scores = predicted values.

For any given cutoff t, we can classify observations into a


category, depending on wether

fˆ(X ) < t or fˆ(X ) ≥ t

The ROC curve is obtained by computing the false positive


and true positive rates for a range of values of t

Emmanuel Flachaire Support Vector Machine (SVM)


Illustration: ROC curves
We write a short function to plot a ROC curve:
16 l i b r a r y (ROCR)
17 r o c p l o t=f u n c t i o n ( p r e d , t r u t h , . . . ) {
18 p r e d o b=p r e d i c t i o n ( p r e d , t r u t h )
19 p e r f=p e r f o r m a n c e ( p r e d o b , ” t p r ” , ” f p r ” )
20 plot ( perf , . . . ) }

We can fit a SVM with radial kernel and plot a ROC curve:
21 set . seed (1)
22 t r a i n=s a m p l e ( 2 0 0 , 1 0 0 )
23 t r a i n=s o r t ( t r a i n , d e c r e a s i n g=TRUE) # t o a v o i d r e v e r s e ROC
24 s v m f i t=svm ( y ˜ . , d a t a=d a t [ t r a i n , ] , k e r n e l=” r a d i a l ” , c o s t =1 ,
gamma=0.5)
25 f i t =a t t r i b u t e s ( p r e d i c t ( s v m f i t , d a t [− t r a i n , ] , d e c i s i o n . v a l u e s=
TRUE) ) $ d e c i s i o n . v a l u e s
26 r o c p l o t ( f i t , d a t [− t r a i n , ” y ” ] , main=” T e s t Data ” , c o l=” r e d )

We can also fit a Logit model and plot a ROC curve:


27 l g t=glm ( y ˜ . , d a t a=d a t [ t r a i n , ] , f a m i l y=b i n o m i a l ( l i n k = ’ l o g i t ’ ) )
28 f i t =p r e d i c t ( l g t , d a t [− t r a i n , ] , t y p e=” r e s p o n s e ” )
29 p a r ( new=TRUE)
30 r o c p l o t ( f i t , d a t [− t r a i n , ” y ” ] , c o l=” g r e e n ” )

Emmanuel Flachaire Support Vector Machine (SVM)


Illustration: ROC curves

As expected in this example, SVM outperforms Logit model

Emmanuel Flachaire Support Vector Machine (SVM)


2. Methods and Algorithms
Ridge and Lasso Regression
Classification and Regression Tree
Bagging and Random Forests
Boosting
Support Vector Machine
Neural Networks and Deep Learning

Emmanuel Flachaire Neural Networks and Deep Learning


Neural networks with one covariate

Emmanuel Flachaire Neural Networks and Deep Learning


Looking for a more flexible model . . .
A linear model maybe quite restrictive:

y ≈ α + βx

We can obtain a more flexible model by adding:


successive powers . . . . . . . . . . . . . . . . . . . polynomial regression49
M
X
y ≈α+ βm x m
m=1

nonlinear functions of linear combinations . . neural networks50


M
X
y ≈α+ βm f (αm + δm x)
m=1

where f is an activation function – a fixed nonlinear function


49
y ≈ α + β1 x + β2 x 2 + β3 x 3 + . . .
50
y ≈ α + β1 f (α1 + δ1 x) + β2 f (α2 + δ2 x) + β3 f (α3 + δ3 x) + . . .
Emmanuel Flachaire Neural Networks and Deep Learning
Common examples of activation functions
1
The logistic (or sigmoid) function: f (x) = 1+e −x
e x −e −x
The hyperbolic tangent function: f (x) = tanh(x) = e x +e −x
The Rectified Linear Unit (ReLU): f (x) = max(0, x) = (x)+

Source: Géron (2017)

Emmanuel Flachaire Neural Networks and Deep Learning


Neural network vs. polynomial: A simple example

Let us consider a realistic (simulated) sample


1 s e t . s e e d ( 1 ) ; n=200
2 x=s o r t ( r u n i f ( n ) )
3 y=s i n ( 1 2 ∗ ( x +.2) ) / ( x +.2) + rnorm ( n ) / 2
4 d f=d a t a . f r a m e ( y , x )

We can fit a polynomial regression with M = 3:


5 o l s=lm ( y ˜ x+I ( x ˆ 2 )+I ( x ˆ 3 ) )
6 p l o t ( x , y , main=” P o l y n o m i a l : M=3” )
7 l i n e s ( x , p r e d i c t ( o l s ) , c o l=” b l u e ” )

We can fit a neural network model with M = 3:


8 library ( neuralnet )
9 nn=n e u r a l n e t ( y ˜ x , d a t a=d f , h i d d e n =3 , t h r e s h o l d =.05)
10 y f i t =compute ( nn , d a t a . f r a m e ( x ) ) $ n e t . r e s u l t
11 p l o t ( x , y , main=” N e u r a l N e t w o r k s : M=3” )
12 l i n e s ( x , y f i t , c o l=” r e d ” )

Emmanuel Flachaire Neural Networks and Deep Learning


Neural network vs. polynomial: A simple example
2
Polynomial: M=3 Neural Networks: M=3

2
1

1
0

0
y

y
−1

−1
−2

−2
−3

−3

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

→ Neural networks can capture nonlinearity


Emmanuel Flachaire Neural Networks and Deep Learning
A weighted sum of fixed/adjustable components
1.0 x x2 x3 α1 + β1x + β2x2 + β3x3

1.0

1.0

2
0.8

0.8

0.8

1
0.6

0.6

0.6

0
u^2

u^3
u

y
0.4

0.4

0.4

−1
0.2

0.2

0.2

−2
−3
0.0

0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

u u u x

f(α1 + b1x) f(α2 + b2x) f(α3 + b3x) α + β1f(α1 + δ1x) + β2f(α2 + δ2x) + β3f(α3 + δ3x)
1.0

2
0.8
0.6

0.8

1
0.6

0.6
0.4

0
f1

f2

f3

y
0.4

0.4

−1
0.2

0.2

−2
0.2

−3
0.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x x x

Emmanuel Flachaire Neural Networks and Deep Learning


Fixed vs. adjustable components
Why neural networks perform better than polynomial regression in
the previous example?
Polynomial regression is based on fixed components, or
bases:51
x, x 2 , x 3 , . . . , x M
Neural network is based on adjustable components, or bases:52

f (α1 + δ1 x), f (α2 + δ2 x), . . . , f (αM + δM x)


Adjustable components have tunable internal parameters
They can express several shapes, not just one (fixed) shape
Each component is more flexible than a fixed component

→ Adjustable components enable to capture complex models with


fewer components (smaller M)
51
y ≈α+ M βm x m
P
52 Pm=1
M
y ≈ α + m=1 βm f (αm + δm x)
Emmanuel Flachaire Neural Networks and Deep Learning
Neural networks with several covariates

Emmanuel Flachaire Neural Networks and Deep Learning


Neural network with several covariates

With a set of covariates X = (1, x1 , x2 , . . . , xk ), we have


M
X
y ≈α+ βm f (αm + X δm )
m=1

The nonlinarity of the activation function f is essential,


otherwise it is a simple linear model in X
Combining several nonlinear functions f is essential to capture
interaction effects, M > 1, otherwise it is just a logit model53

By adding nonlinear functions of linear combinations of X , we


obtain a more flexible model, which is able to capture nonlinearity
and interaction effects

53
With M = 1 and the logistic activation function, it is a logit model
Emmanuel Flachaire Neural Networks and Deep Learning
Interaction effects

Adding two nonlinear functions can generate an interaction effect:


2
X
y ≈α+ βm f (αm + x1 δm + x2 γm )
m=1

Let us consider α = α1 = α2 = 0, β1 = −β2 = 41 , α1 = α2 = 0,


δ1 = δ2 = γ1 = −γ2 = 1 and f (z) = z 2 , we have:
1 1
y ≈ 0 + (0 + x1 + x2 )2 − (0 + x1 − x2 )2
4 4
1
≈ [(x1 + x2 )2 − (x1 − x2 )2 ]
4
≈ x1 x2
So the sum of two nonlinear transformations of linear functions can
give us an interaction! Here, we would always get a 2nd-degree
polynomial in X . Other activations do not have such a limitation.
Emmanuel Flachaire Neural Networks and Deep Learning
XOR: Exclusive or (true if its arguments differ)
P2
Diagram of y ≈ α + m=1 βm f (αm + x1 δm + x2 γm )

Source: Géron (2017, p.260)

With the step activation function (=1 if positive, 0 otherwise)


y ≈ −0.5 − I(x1 + x2 > 1.5) + I(x1 + x2 > 0.5)
With (0,0) or (1,1) we have -0.5, with (1,0) or (0,1) we have +0.5

Emmanuel Flachaire Neural Networks and Deep Learning


Neural network with a single hidden layer

Source: James et al. (2021)


PM
Diagram of y ≈ α + m=1 βm f (αm + X δm ) with M = 5 neurons
Emmanuel Flachaire Neural Networks and Deep Learning
Ridge regularization & standardization

NN tends to overfit due to large number of coefficients


A solution is to regularize similar to ridge regression:54
Pp 2
Minimize the SSR subject to j=1 θj ≤c

The results are sensitive to the scale of the covariates


It is best to standardize covariates before using Neural
Networks, so that they are all on the same scale:
x − x̄
p
Var(x)

54
θ is the set of coefficients α, β, δ
Emmanuel Flachaire Neural Networks and Deep Learning
Backpropagation algorithm

In 1986, Rumelhart et al. found a way to train neural networks,


with the backpropagation algorithm.55 Today, we would call it a
Gradient Descent using reverse-mode autodiff.

For each training instance:


1 the algorithm first makes a prediction (forward path)
2 measures the error
3 goes through each layer in reverse to measure the error
contribution from each connection (reverse pass)
4 slightly tweaks the connection weights to reduce the error
(Gradient Descent step)

55
Rumelhart et al.: Learning Internal Representations by Error Propagation
Emmanuel Flachaire Neural Networks and Deep Learning
Application 1: Mincer equation
1 l i b r a r y (AER) ; d a t a ( ” CPS1985 ” )
2 CPS1985 $ g e n d e r=a s . n u m e r i c ( CPS1985 $ g e n d e r )
3 library ( neuralnet )
4 nn=n e u r a l n e t ( l o g ( wage ) ˜ e d u c a t i o n+e x p e r i e n c e+g e n d e r , d a t a=
CPS1985 , h i d d e n =3 , t h r e s h o l d =.05)
5 p l o t ( nn )
1 1

−2
3.1
44
37
education −1.85739

−00 .456

0.
16
.0 21

57
54

6
94

0.95

1.50
919

42
6
69
31
1.

experience −0.16691 −2.4352 log(wage)


−0
.0
214
2

−2.8359
.6 5
29
−0.6872

5
93

95
19
4

1.

gender −1.54855

Error: 49.601795 Steps: 1609

Emmanuel Flachaire Neural Networks and Deep Learning


Application in classification
Logit model Neural network with 10 units

Source: Hastie, Tibshirani and Friedman (2009), based on simulated data

In classification, the softmax function is applied to the outputs


Emmanuel Flachaire Neural Networks and Deep Learning
Multilayer neural networks

Emmanuel Flachaire Neural Networks and Deep Learning


Multilayer neural networks

Even greater flexibility is achieved via composition of activation


functions:
M P
!
 
(1) (2) (2) (1)
X X
y ≈α+ βm f αm + f αp + X δp δm
m=1 p=1
| {z }
it replaces X

The composition of activation functions puts one additional


hidden layer between inputs and outputs → multi-layers NN

A NN with three hidden layers can be obtained by simply


repeating the procedure used to create the two layer basis.

Multilayer neural networks: when a NN has 2 or more hidden layers

Emmanuel Flachaire Neural Networks and Deep Learning


Multilayer neural networks

Source: James et al. (2021)

Emmanuel Flachaire Neural Networks and Deep Learning


Multilayer neural networks

From single layer with many neurons to multilayer with less neurons

James et al. (2021):

In theory a single hidden layer with a large number of


units/neurons has the ability to approximate most functions
However, the learning task of discovering a good solution is
made much easier with multiple layers each of modest size
Modern neural networks typically have more than one hidden
layer, and often many units/neurons per layer

Deep Neural Networks = Multilayer Neural Networks

Emmanuel Flachaire Neural Networks and Deep Learning


Application 1: Mincer equation
1 nn=n e u r a l n e t ( l o g ( wage ) ˜ e d u c a t i o n+e x p e r i e n c e+g e n d e r , d a t a=
CPS1985 , h i d d e n=c ( 3 , 3 ) , t h r e s h o l d =.05)
2 p l o t ( nn )
3 nn=n e u r a l n e t ( l o g ( wage ) ˜ e d u c a t i o n+e x p e r i e n c e+g e n d e r , d a t a=
CPS1985 , h i d d e n=c ( 3 , 3 , 3 ) , t h r e s h o l d =.05)
4 p l o t ( nn )
1 1 1 1 1 1 1
62

2.1

6.2

0.3
−5
.0
26

26

95

19
.63
08

54

07

36
6

education 2.53567 −0.44262 2.79195 4.72269 0.59622


10.

−−0

0.8

−1.

−0.

1−1

−1
.0 1837

.7 .1967
18 .22332

0.3 125

0.8 967

.6
01

38

57
52

63
.2

59

66

33
96

15

1
15

95

98
78

−0.63
34.09
−11.5

1.502

0.398

0.920

1.232
62

27
534

153
29

05
44

75

86

09

18
76

73
7
36

77
9
7.8

0.9
98
.6

.8
2.8
−1

−1

−5

−2
experience 0.03246 −1063.56013 −0.6148 log(wage) −0.02541 −2.30108 −1.05663 1.71383 log(wage)
0.2

1.1

−3

−2

35
.2
.7

0.1
09

31

15
86
39

39

45

33
3

85

−24.08
−4.652

−0.890

−1.314
12.061
81

7.7 42
4.7 62
87

37
50 8

126
26

63
−3 .236

−1 0.16
−1 .661

1.04174

28 0155
83

28

81

14
93

94
49

07

42
43

74
16

02
.8

.6
−8

−2
19

4.8

2.

0.

0.4
gender 1.88576 0.69849 −3.08678 0.22792 93.02464

Error: 47.665751 Steps: 23831 Error: 50.041345 Steps: 5234

Emmanuel Flachaire Neural Networks and Deep Learning


Pattern recognition

Everything is just numbers:

Source: internet, link

A 18x18 pixel image can be seen as an array of 324 numbers that


represent how dark each pixel is (grayscale value in (0, 255))
A vector of these numbers can be used to feed a neural networks
Emmanuel Flachaire Neural Networks and Deep Learning
MNIST handwritten digit dataset

Source: James et al. (2021)


Input vector X : p = 28 × 28 = 784 pixels
Output Y : class label Y = (Y0 , Y1 , . . . , Y10 )
60,000 training images and 10,000 test images
Emmanuel Flachaire Neural Networks and Deep Learning
Application 2: Handwritten digit recognition
1 # S o u r c e : s e c t i o n 1 0 . 9 . 2 i n James e t a l . ( 2 0 2 1 )
2 l i b r a r y ( keras )
3 # l o a d t h e MNIST d i g i t d a t a
4 m n i s t <− d a t a s e t m n i s t ( )
5 x t r a i n <− m n i s t $ t r a i n $ x
6 g t r a i n <− m n i s t $ t r a i n $ y
7 x t e s t <− m n i s t $ t e s t $ x
8 g t e s t <− m n i s t $ t e s t $ y
9 # reshape images i n t o m a t r i c e s
10 x t r a i n <− a r r a y r e s h a p e ( x t r a i n , c ( nrow ( x t r a i n ) , 7 8 4 ) )
11 x t e s t <− a r r a y r e s h a p e ( x t e s t , c ( nrow ( x t e s t ) , 7 8 4 ) )
12 y t r a i n <− t o c a t e g o r i c a l ( g t r a i n , 1 0 )
13 y t e s t <− t o c a t e g o r i c a l ( g t e s t , 1 0 )
14 # r e s c a l e to the u n i t i n t e r v a l
15 x t r a i n <− x t r a i n / 255
16 x t e s t <− x t e s t / 255
17 # d e f i n e t h e m u l t i l a y e r NN
18 modelnn <− k e r a s model s e q u e n t i a l ( )
19 modelnn %>%
20 l a y e r dense ( u n i t s = 256 , a c t i v a t i o n = ” r e l u ” ,
21 i n p u t s h a p e = c ( 7 8 4 ) ) %>%
22 l a y e r d r o p o u t ( r a t e = 0 . 4 ) %>%
23 l a y e r d e n s e ( u n i t s = 1 2 8 , a c t i v a t i o n = ” r e l u ” ) %>%
24 l a y e r d r o p o u t ( r a t e = 0 . 3 ) %>%
25 l a y e r dense ( u n i t s = 10 , a c t i v a t i o n = ” softmax ” )
26 summary ( modelnn )
27 # add d e t a i l s t o t h e model
28 modelnn %>% c o m p i l e ( l o s s = ” c a t e g o r i c a l c r o s s e n t r o p y ” ,
29 o p t i m i z e r = o p t i m i z e r rmsprop ( ) , m e t r i c s = c ( ” a c c u r a c y ” )
30 )

Emmanuel Flachaire Neural Networks and Deep Learning


Application 2: Handwritten digit recognition
31 # f i t t h e NN w i t h t r a i n i n g d a t a
32 system . time (
33 h i s t o r y <− modelnn %>%
34 f i t ( x t r a i n , y t r a i n , epochs = 30 , batch s i z e = 128 ,
35 validation s p l i t = 0.2)
36 )
37 p l o t ( h i s t o r y , smooth = FALSE )
38 # obtain the t e s t e r r o r
39 a c c u r a c y <− f u n c t i o n ( p r e d , t r u t h )
40 mean ( d r o p ( p r e d ) == d r o p ( t r u t h ) )
41 modelnn %>% p r e d i c t ( x t e s t ) %>% max . c o l %>% a c c u r a c y ( g t e s t +1)
42
43 # f i t a m u l t i n o m i a l l o g i t a s a NN w i t h o u t h i d d e n l a y e r
44 m o d e l l r <− k e r a s model s e q u e n t i a l ( ) %>%
45 l a y e r dense ( i n p u t shape = 784 , u n i t s = 10 ,
46 a c t i v a t i o n = ” softmax ” )
47 summary ( m o d e l l r )
48 m o d e l l r %>% c o m p i l e ( l o s s = ” c a t e g o r i c a l c r o s s e n t r o p y ” , o p t i m i z e r = o p t i m i z e r
rmsprop ( ) , m e t r i c s = c ( ” a c c u r a c y ” ) )
49 m o d e l l r %>% f i t ( x t r a i n , y t r a i n , e p o c h s = 3 0 , b a t c h s i z e = 1 2 8 , v a l i d a t i o n
s p l i t = 0.2)
50 m o d e l l r %>% p r e d i c t ( x t e s t ) %>% max . c o l %>% a c c u r a c y ( g t e s t +1)

You may need to install Keras first:


1 i n s t a l l . packages ( ” t e n s o r f l o w ” )
2 i n s t a l l . packages ( ” keras ” )
3 l i b r a r y ( keras )
4 tensorflow : : i n s t a l l tensorflow ()
5 tensorflow : : tf config ()
6 i n s t a l l keras ()

Emmanuel Flachaire Neural Networks and Deep Learning


Multilayer NN for handwritten digit recognition

Source: James et al. (2021)

NN with 2 hidden layers L1 (256 units) and L2 (128 units)


235,146 coef in the NN and 7,065 in the multinomial logit56
To avoid overfitting, two forms of regularization are used

56
L1 : 785×256=200,960 and L2 : 257×128=32,896 and 10-outputs 129×10
Emmanuel Flachaire Neural Networks and Deep Learning
Dropout regularization

Source: James et al. (2021)

New efficient form of regularization, inspired by random forest


Randomly remove a fraction of the units in a layer
In practice, randomly set the dropped out units to zero

Emmanuel Flachaire Neural Networks and Deep Learning


Limitations

Multilayer NN can model complex non-linear relationships


With very complex problems, such as detecting hundreds of types
of objects in high-resolution images, we need to train deeper NN:

perhaps 10 layers, each with hundreds of neurons, connected


by hundreds of thousands of connections
training a fully-connected DNN is very slow
severe risk of overfitting with millions of parameters
gradients problems make lower layers very hard to train

Solutions:

Convolutional Neural Networks (CNN or ConvNets)


Recurrent Neural Networks (RNN)

Emmanuel Flachaire Neural Networks and Deep Learning


Convolutional Neural Networks

Emmanuel Flachaire Neural Networks and Deep Learning


Pattern recognition

The network fails to recognize ’8’ when the letter is not centered
→ translation, scale and (small) rotation invariances are needed
The solution is convolution

Emmanuel Flachaire Neural Networks and Deep Learning


Convolutional Neural Network (CNN or ConvNet)58
Step 1: Break the image into overlapping image tiles and, feed
each image tile into a small neural network with the same weights57

→ It remains to use a sliding window over the entire picture


→ using the same small NN reduces the number of weights
→ same neural networks weights ≡ filter or convolution kernel
57
and the same activation function, ReLU=max(0,input), tanh or sigmoid
58
Source: Adam Geitgey link , Ujjwal Karn link , Andrej Karpathy link
Emmanuel Flachaire Neural Networks and Deep Learning
CNN: The convolution step 1
CNN exploit spatially local correlation: each neuron is locally-
connected (to only a small region of the input volume)

Source: Géron (2017)

→ Different values of weights will produce different feature maps


→ The convolution step plays like a filter
→ Different filters can detect different features from an image59
59
as for instances edges, curves, . . .
Emmanuel Flachaire Neural Networks and Deep Learning
CNN: The convolution step 1

Source: James et al. (2021)

Emmanuel Flachaire Neural Networks and Deep Learning


CNN: The pooling step 2
Step 2: Reduce the size of the array, using a pooling algorithm.

2x2 pooling layer, no padding


Source: Géron (2017)

The pooling step reduces the dimensionality of each feature map


but retains the most important information60
Pooling can be of different types: Max, Average, Sum etc.
60
It is also called subsampling or downsampling step
Emmanuel Flachaire Neural Networks and Deep Learning
CNN: The pooling step 2

The function of pooling is to progressively reduce the spatial size


of the input representation. In particular, pooling:
makes the input representations smaller and more manageable
reduces the number of weights and links in the network,
therefore, controlling overfitting
makes the network invariant to small transformations,
distortions and translations in the input image61
helps us arrive at an almost scale invariant representation of
our image62

61
a small distortion in input will not change the output of Pooling – since we
take the maximum/average value in a local neighborhood
62
This is very powerful since we can detect objects in an image no matter
where they are located
Emmanuel Flachaire Neural Networks and Deep Learning
CNN: The classification step 3
Step 3: Make a final prediction with a fully-connected network

Source: James et al. (2021)

Feature extraction: use even more steps (hidden-layers) to extract


the useful features from the images.The more convolution steps
you have, the more complicated features your network will be able
to learn to recognize.
Classification: The purpose of the Fully Connected layer is to use
the high-level features for classifying the input image into classes
Emmanuel Flachaire Neural Networks and Deep Learning
CNN: Intuitive principle
A hierarchy of representations with increasing level of abstraction:

Source: Internet

• Extract local features that depend on small subregions of the image


• Information from these features are merged to detect higher-order features

→ construction of complex objects from elementary parts


Image recognition: pixel → edge → texton → motif → part → object
Text: character → word → word group → clause → sentence → story
Speech: sample → spectral band → sound → . . . → phoneme → word

Emmanuel Flachaire Neural Networks and Deep Learning


CNN: An example on number recognition

To understand how ConvNet works, play with this animation link

Emmanuel Flachaire Neural Networks and Deep Learning


CNN: Performance in practice

Source: Hastie et al. (2016)


Convolutional Neural Networks outperform other methods
The number of weights in Net-5 is much less than in Net-1
ConvNet has been ”a revolution in Artificial Intelligence”
See the inaugural lesson of Yann LeCun at the College de France, in
English en or in French fr , and the review paper in Nature pdf

Emmanuel Flachaire Neural Networks and Deep Learning


CNN: Detection in complex cases

See this animation link

Emmanuel Flachaire Neural Networks and Deep Learning


CNN: Detection in complex cases

Source: Ren et al. (2016), https://arxiv.org/pdf/1506.01497v3.pdf


Emmanuel Flachaire Neural Networks and Deep Learning
Recurrent Neural Networks

Emmanuel Flachaire Neural Networks and Deep Learning


Nature of the data

Many data sources are sequential in nature:

In text analysis, the sequence and relative position of words


capture the narrative, theme and tone → Document
classification, sentiment analysis and language translation

Time series of temperature, rainfall, wind speed, air quality


and so on → Weather forecast

Time series of market indices, stock and bon prices and


exchange rates → Financial forecasting

In Recurrent Neural Network, the input object X is a sequence

Emmanuel Flachaire Neural Networks and Deep Learning


Recurrent Neural Networks (RNN)

Neural network with a single hidden layer, for t = 1, . . . , T :

yt ≈ α + At β

At = [f (α1 + Xt δ1 ), . . . , f (αM + Xt δM )]
A linear combination of a nonlinear fct of linear combinations of Xt
Recurrent neural network:
Each time series provides many short mini-series of input
sequences X = {X1 , . . . , XL } of L periods, and a target Y

At = [f (α1 + Xt δ1 + At−1 γ1 ), . . . , f (αM + Xt δM + At−1 γM )]

Identical weights for each sequence: α, δ, γ independent of t


A form of weight sharing similar to the use of filters in CNN

Emmanuel Flachaire Neural Networks and Deep Learning


Recurrent Neural Networks

Source: James et al. (2021)


Emmanuel Flachaire Neural Networks and Deep Learning
RNN: Number of lags, training and test data

Recurrent neural network with one hidden layer, for T = 1, . . . , T :

yt ≈ α + At β

At = [f (α1 + Xt δ1 + At−1 γ1 ), . . . , f (αM + Xt δM + At−1 γM )]

Past values of yt and other covariates can be used in Xt


Select a number of lags L with care, perhaps using CV
Extract many short series of (y , X ) with a predefined length L
Each short serie can be used to predict one value yt
The training data consists of n separate series of length L
The test data consists of the remaining series of length L
Find the set of coefficients minimizing the SSR (subject to a
constraint) based on the training and test data
Emmanuel Flachaire Neural Networks and Deep Learning
RNN: Historical trading time series on the NYSE

Source: James et al. (2021)

→ forecast (log) trading volume


Emmanuel Flachaire over 1980-86
Neural based
Networks and on past history
Deep Learning
RNN: Forecast trading volume based on past history

Source: James et al. (2021)


Emmanuel Flachaire Neural Networks and Deep Learning
RNN: Autocorrelation function

Source: James et al. (2021)

T = 6051, L = 5, so 6046 short series (y , X ) are available


fit the model with 12 neurons and using 4281 training series
forecast 1765 values after January 2, 1980
Emmanuel Flachaire Neural Networks and Deep Learning
RNN: Forecast of log trading volume on the NYSE

See section 10.9.6 in James et al. (2021) for details of the implementation in R
Emmanuel Flachaire Neural Networks and Deep Learning
RNN and AR models

Recurrent neural network with a single hidden layer:

yt ≈ α + At β

At = [f (α1 + Xt δ1 + At−1 γ1 ), . . . , f (αM + Xt δM + At−1 γM )]

Lag of the dependent variable yt−1 can be used in Xt


With M = 1, f linear and Xt = yt−1 , we have an AR(L):

yt ≈ β0 + yt−1 β1 + · · · + yt−L βt−L

RNN and AR models have much in common


By combining nonlinear functions (M > 1 and f nonlinear),
RNN add more flexibilty → nonlinear and interaction effects

Emmanuel Flachaire Neural Networks and Deep Learning


LSTM: Long and Short Term Memory model

My son is a manga fan, so our next holiday will be in . . .

RNN don’t predict Japan, since it doesn’t remember manga


RNN main limitation: short term memory
Solution: Combine 2 hidden layers, one with short memory
and the other one with longer memory
LSTM combine a long-term state c and a short-term state h

Emmanuel Flachaire Neural Networks and Deep Learning


LSTM vs. RNN

Emmanuel Flachaire Neural Networks and Deep Learning


LSTM

Géron (2017)

c: drop some memories ⊗ and add some new memories ⊕


Emmanuel Flachaire Neural Networks and Deep Learning
3. Using ML methods in Econometrics
Misspecification detection
Causal inference

Emmanuel Flachaire Using ML methods in Econometrics


General Principle

Machine Learning: solve the optimization problem


n
X
Minimize L(yi , m(Xi )) + λ kmk`q
m | {z }
i=1
| {z }
loss function penalization

Choice of the loss function:


L → conditional mean, quantiles, classification
m → linear, splines, tree-based models, neural networks

Choice of the penalization:


`q → lasso, ridge
λ → over-fitting, under-fitting, cross validation

Emmanuel Flachaire Using ML methods in Econometrics


Ridge and Lasso

n
X p
X
2
Minimize (yi − Xi β) + λ |βj |q
β
i=1 j=2

Pp q
It is equivalent to minimize SSR subject to j=2 |βj | ≤c

The constraint restricts the magnitude of the coefficients


It shrinks the coefficients towards zero as c & (or λ %)
Add some bias if it leads to a substantial decrease in variance
q = 2: Ridge, β̂ = (X > X + λIn )−1 X > y is defined with p  n
q = 1: Lasso sets some coef exactly to 0, variable selection

→ High-dimensional problems (p  n)

Emmanuel Flachaire Using ML methods in Econometrics


Random Forest, Boosting, Deep learning

n
X Z
2
Minimize (yi − m(Xi )) + λ m00 (x)2 dx
m
i=1

m00 (x)2 dx ≤ c
R
It is equivalent to minimize SSR subject to

A fully nonparametric model: y ≈ m(X1 , . . . , Xp )


The constraint restricts the flexibility of m
Choice of m: Random forest, boosting or deep learning
Similar to nonparametric econometrics (splines)
Appropriate with many covariates (no curse of dimensionality)

→ Complex functional form

Emmanuel Flachaire Using ML methods in Econometrics


Why and how to use ML methods in Econometrics?

Pros:
High-dimensional problems
Complex functional forms

However,
Black-box models
Prediction is not causation

Emmanuel Flachaire Using ML methods in Econometrics


3. Using ML methods in Econometrics
Misspecification detection
Causal inference

Emmanuel Flachaire Misspecification detection


A major criticism to econometrics

Léo Breiman (Statistical Science, 2001):

. . . an uncritical use of data models.

Emmanuel Flachaire Misspecification detection


Misspecification can lead to wrong conclusions

Let us assume that the true regression function is:

y = β0 + β1 x + β3 x 3 + ε (5)

A parametric test of the following hypotheses:

H0 : y = β0 + β1 x + ε vs. H1 : y = β0 + β1 x + β2 x 2 + ε

may not reject the null, since β2 = 0 is true in (5)

To the opposite, a test statistic based on

H0 : y = β0 + β1 x + ε vs. H1 : y = m(x) + ε

would likely reject the null

A nonparametric model is more appropriate under H1

Emmanuel Flachaire Misspecification detection


How machine learning tools may help econometrics?

Parametric model:
y = Xβ + ε
Fully-nonparametric model:

y = m(X ) + ε

Is the parametric regression model correctly specified?


If no, ML methods should outperform OLS estimation
If yes, ML methods should not outperform OLS estimation

ML can be used to detect misspecification

Emmanuel Flachaire Misspecification detection


Application 1: Boston housing prices

Boston housing dataset: 14 variables (2 dummies), 506


observations63
OLS in a linear regression model, p=13

medv = X β + ε

Lasso with squares, cubes and pairwise interactions, p=117

medv = X β1 + X 2 β2 + X 3 β3 + (X :X )β4 + ε

Random Forest and Boosting in a nonparametric model, p=13

medv = m(X ) + ε

We compute the MSE by 10-folds Cross-Validation

63
X = [chas,nox,age,tax,indus,rad,dis,lstat,crim,black,rm,zn,ptratio]
Emmanuel Flachaire Misspecification detection
Application 1: Boston housing prices
1 l i b r a r y (MASS) ; l i b r a r y ( r a n d o m F o r e s t ) ; l i b r a r y ( gbm ) ; l i b r a r y ( g l m n e t )
2 d a t a ( B o s t o n ) ; n o b s=nrow ( B o s t o n )
3 s e t . s e e d ( 1 2 3 4 5 ) ; n f o l d =10
4 K f o l d=c u t ( s e q ( 1 , n o b s ) , b r e a k s=n f o l d , l a b e l s =FALSE )
5 mse . t e s t=m a t r i x ( 0 , n f o l d , 4 )
6 # g e n e r a t e Xˆ2 Xˆ3 and p a i r w i s e i n t e r a c t i o n s f o r t h e L a s s o
7 X c o l=c o l n a m e s ( B o s t o n ) [ −14]
8 X s q r=p a s t e 0 ( ” I ( ” , Xcol , ” ˆ 2 ) ” , c o l l a p s e=”+” ) # s q u a r e d c o v a r i a t e s
9 Xcub=p a s t e 0 ( ” I ( ” , Xcol , ” ˆ 3 ) ” , c o l l a p s e=”+” ) # c u b i c c o v a r i a t e s
10 f m l a=p a s t e 0 ( ”medv˜ ( . ) ˆ2+” , Xsqr , ”+” , Xcub )
11 X=model . m a t r i x ( a s . f o r m u l a ( f m l a ) , d a t a=B o s t o n ) [ , −1]
12 y=B o s t o n [ , 1 4 ]
13 mysample=s a m p l e ( 1 : n o b s ) # random s a m p l i n g ( p e r m u t a t i o n )
14 f o r ( i i n 1 : n f o l d ) { # K−f o l d CV
15 c a t ( ”K−f o l d l o o p : ” , i , ”\ r ” )
16 t e s t=mysample [ w h i c h ( K f o l d==i ) ]
17 t r a i n=mysample [ w h i c h ( K f o l d !=i ) ]
18 # OLS , L a s s o , Random F o r e s t , B o o s t i n g
19 f i t . lm <− lm ( medv˜ . , d a t a=Boston , s u b s e t=t r a i n )
20 f i t . l a <− c v . g l m n e t (X [ t r a i n , ] , y [ t r a i n ] , a l p h a =1)
21 f i t . r f <− r a n d o m F o r e s t ( medv˜ . , d a t a=Boston , s u b s e t=t r a i n , mtry =6)
22 f i t . bo <− gbm ( medv˜ . , d a t a=B o s t o n [ t r a i n , ] , d i s t r i b u t i o n =” g a u s s i a n ” ,
i n t e r a c t i o n . d e p t h =6)
23 # out−s a m p l e MSE
24 mse . t e s t [ i , 1 ] = mean ( ( B o s t o n $medv−p r e d i c t ( f i t . lm , B o s t o n ) ) [− t r a i n ] ˆ 2 )
25 mse . t e s t [ i , 2 ] = mean ( ( y−p r e d i c t ( f i t . l a , X , s=” lambda . min ” ) ) [− t r a i n ] ˆ 2 )
26 mse . t e s t [ i , 3 ] = mean ( ( B o s t o n $medv−p r e d i c t ( f i t . r f , B o s t o n ) ) [− t r a i n ] ˆ 2 )
27 mse . t e s t [ i , 4 ] = mean ( ( B o s t o n $medv−p r e d i c t ( f i t . bo , B o s t o n ) ) [− t r a i n ] ˆ 2 )
28 }
29 mse=c o l M e a n s ( mse . t e s t ) # t e s t e r r o r
30 r o u n d ( mse , d i g i t s =2)
31 [ 1 ] 23.93 14.88 10.16 10.34

Emmanuel Flachaire Misspecification detection


Application 1: Boston housing prices

Boston housing dataset:64


b 10−CV
R ols lassox 2 x 3 int r.forest boosting
MSE 23.93 14.88 10.16 10.34

Random Forest and Boosting show impressive improvement


over OLS, in terms of predictive performance
ML models are known to capture complex functional forms
It suggests that the parametric model lacks important
nonlinear and/or interaction effects
Lasso provides substantial improvement over OLS, but is still
less performant than Random Forest and Boosting. It
suggests that some nonlinearities are still not well captured.
64
14 variables (2 dummies), 78 pairwise interactions, 506 observations
Emmanuel Flachaire Misspecification detection
GamLa: An econometric model for interpretable ML
A partially linear model:

y = g1 (X1 ) + . . . + gp (Xp ) + Z γ + ε

with Z a matrix of pairwise interactions Z = (X1 X2 , . . . , Xq−1 Xq ).


The marginal effect is:
∂y
= gj0 (Xj ) + c
∂Xj

where c is a constant term which depends on the other covariates.


Combine non-linearity in Xj and linear pairwise interactions
The linearity assumption on interaction effects represents the
price to pay to keep the model interpretable.

→ GamLa = GAM + variable selection (Lasso, Autometrics)65


65
Flachaire, Hacheme, Hué, Laurent (2022)
Emmanuel Flachaire Misspecification detection
GamLa: An econometric model for interpretable ML

A partially linear model:

y = g1 (X1 ) + . . . + gp (Xp ) + Z γ + ε

Estimation based on the Double Residuals (DR) method:


1 GAM of y on X1 , . . . , Xp : compute the residuals η̂y
2 GAM of Zj on X1 , . . . , Xp : compute the residuals η̂zj , ∀j
3 LASSO of η̂y on η̂z1 , . . . , η̂zl → obtain γ̂
An application of FWL to semiparametric regression models

Robinson (1988) shows that with DR γ̂ols is n-consistent,
even if ĝ1 (X ), . . . , ĝp (X ) are consistent at slower rates
Flachaire, Hacheme, Hué and Laurent (2022) show that using
the DR approach is crucial to select correctly the
interactions66
66
So don’t use the gamlasso function in the R package plsmselect!
Emmanuel Flachaire Misspecification detection
Application 1: Boston housing prices

Boston housing dataset:


b 10−CV
R ols lassox 2 x 3 int r.forest gamla
MSE 23.93 14.88 10.16 9.73

GamLa shows impressive improvement over OLS, in terms of


predictive performance
GamLa performs as well as Random Forest and Boosting67
It suggests that parametric models are outperformed by ML
models when they lack important nonlinear and/or interaction
effects only

67
Model Confidence Set (MCS) test can be used to test if the MSE are
significantly different (Hansen, Lunde and Nason 2011) pdf Pairwise AUC
can be used in classification (Candelon, Dumitrescu and Hurlin 2012) pdf
Emmanuel Flachaire Misspecification detection
Conclusion

Many results report that ML outperform parametric models in


terms of predictive performance

ML models outperform standard parametric model ... which


are not well-specified!
ML methods can help to detect and correct misspecification in
parametric regression
Parametric models can perform as well as ML models!

Emmanuel Flachaire Misspecification detection


3. Using ML methods in Econometrics
Misspecification detection
Causal inference

Emmanuel Flachaire Causal Machine learning


Prediction is not causation

Kleinberg et al. (2015) Prediction policy problems


Many policy applications where causal inference is not central
Hips or knees replacement: costly, painful, recovery takes time
Policy decision: predicting the riskiest patients68

Athey (2017) Beyond prediction: Using big data for policy problems
Pure prediction methods are not helpful for causal problems
Which patients should be given priority to receive surgery?
Estimating heterogeneity in the effect of surgery is required

68
ML are used to predict the probability that a candidate would die within a
year from other causes. Identify high risk patients who shouldn’t receive surgery
Emmanuel Flachaire Causal Machine learning
High-dimensional parametric framework

Emmanuel Flachaire Causal Machine learning


Inference on target regression coefficients

Our main concern is the estimation and inference on α in a


high-dimensional framework:

y = dα + X β + ε

d is a target regressor as treatment, policy or other variable


X may contain many variables, a few of them are important
With sparsity, a variable selection method is used in a 1st step
Since Lasso shrinks coefficients towards zero, coef are biased
Correct this bias using an additional (unrestricted) estimation

→ Post-selection estimation and inference

Emmanuel Flachaire Causal Machine learning


The problem of post-selection inference

Single Selection: OLS of y on d and the selected variables X ∗

y = dα + X ∗ β + ε

Unbiased α̂ . . . if the true model is selected only!


Problem: mistakes from the variable selection can introduce
omitted variable bias
one covariate Xj strongly correlated to d without a strong
effect on y may be omitted in the variable selection process

Ignoring variable selection uncertainty may be misleading

→ Naive post-selection estimation may be biased

Emmanuel Flachaire Causal Machine learning


Post-selection inference: Double selection
Our main concern is estimation and inference on α in
y = dα + X β + ε
Double Selection:69
1 Lasso of y on X : select variables important to predict y
2 Lasso of d on X : select variables important to predict D
OLS of y on d and the union of the selected variables

y = dα + X ∗∗ β + ε

Idea: give a 2nd chance to omitted variables in the first Lasso


α̂ is immunized against variable selection mistakes
→ valid post-selection inference in high-dimensions

69
Belloni, Chernozhukov and Hansen (2014) pdf Uniformly valid confidence
set for α despite imperfect model selection, and full efficiency for estimating α
Emmanuel Flachaire Causal Machine learning
Post-selection inference: Partialling out

Our main concern is estimation and inference on α in


y = dα + X β + ε
Partialling out:
1 Lasso of y on X : compute the residuals η̂y
2 Lasso of d on X : compute the residuals η̂d
OLS of η̂y on η̂d (double residuals approach)

η̂y = η̂d α + ε

Idea: an application of the Frisch-Waugh-Lovell theorem70

→ Partialling out and double selection are quite similar71


70
But α̂ols is different in the two models due to lasso variable selections
71
From the FLW theorem, the double selection estimator of α is equal to the
OLS estimator of the residuals of y on X ∗∗ on the residuals of d on X ∗∗ .
Emmanuel Flachaire Causal Machine learning
Threshold selection: Rigorous Lasso

The choice of the penalization parameter λ is crucial


Optimal λ for prediction and estimation are different
CV targets prediction and lacks theoretical foundations
Theoretical grounded and feasible selection for estimation:72

λ = 2c nσ̂Φ−1 (1 − γ/(2p))

in the case of homoskedasticity


Another selection is proposed in the heteroskedasticity case

72
See Belloni, Chernozhukov and Hansen (2014) pdf c = 1.1 for
post-Lasso and c = 0.5 for Lasso, γ = .1 by default
Emmanuel Flachaire Causal Machine learning
Bias of naive post-selection estimation

Source: Belloni, Chernozhukov, Hansen (2014)

Emmanuel Flachaire Causal Machine learning


Application 1: Do poor countries catch up rich countries?

We are interested in the convergence hypothesis α < 0 in

y = dα + X β + ε

where y is the growth rate of GDP, d is the initial level of GDP


and X contains many countries characteristics

The parameter of interest is α


We test the null hypothesis H0 : α = 0
If H0 is rejected and α < 0: evidence of catch-up effect
Covariate selection is crucial, since p = 63 and n = 90
We use double selection and partialling out with rigorous Lasso
Implementation is done with the R package hdm73
73
see the vignette of the hdm package in R pdf

Emmanuel Flachaire Causal Machine learning


Application 1: Do poor countries catch up rich countries?

1 l i b r a r y ( hdm )
2 d a t a ( ” GrowthData ” ) # t h e 2 nd column i s a v e c t o r o f one #
3 y=a s . m a t r i x ( GrowthData ) [ , 1 , d r o p=F ]
4 d=a s . m a t r i x ( GrowthData ) [ , 3 , d r o p=F ]
5 X=a s . m a t r i x ( GrowthData ) [ , − c ( 1 , 2 , 3 ) , d r o p=F ]
6 # f i t models
7 LS . f i t =lm ( y ˜d+X)
8 PO . f i t = r l a s s o E f f e c t (X , y , d , method=” p a r t i a l l i n g o u t ” )
9 DS . f i t = r l a s s o E f f e c t (X , y , d , method=” d o u b l e s e l e c t i o n ” )
10 # i n f e r e n c e on c o e f o f i n t e r e s t
11 LS=summary ( LS . f i t ) $ c o e f f i c i e n t s [ 2 , ]
12 PO=summary (PO . f i t ) $ c o e f f i c i e n t s [ 1 , ]
13 DS=summary (DS . f i t ) $ c o e f f i c i e n t s [ 1 , ]
14 r b i n d ( o l s=LS , d o u b l e . s e l e c t i o n=DS , p a r t i a l l i n g . o u t=PO)

15 Estimate Std . E r r o r t v a l u e Pr ( >| t | )


16 ols −0.009377989 0 . 0 2 9 8 8 7 7 3 −0.31377 0 . 7 5 6 0 1
17 double . s e l e c t i o n −0.050005855 0 . 0 1 5 7 9 1 3 8 −3.16665 0 . 0 0 1 5 4
18 p a r t i a l l i n g . out −0.049811465 0 . 0 1 3 9 3 6 3 6 −3.57420 0 . 0 0 0 3 5

Emmanuel Flachaire Causal Machine learning


Application 1: Do poor countries catch up rich countries?

Inference on the parameter of interest α:


Estimate Std.Error t value Pr(>|t|)
OLS -0.00938 0.02989 -0.31377 0.75601
Double selection -0.05001 0.01579 -3.16665 0.00154
Partialling out -0.04981 0.01394 -3.57420 0.00035

H0 : α = 0 not rejected with OLS (large standard error)74


H0 : α = 0 rejected with double selection and partialling out
- more precise estimate (smaller standard error)
- greater magnitude of the coefficient
Poor countries tend to catch up rich countries!
Note that Single Selection (naive post-selection) put α = 0:
15 r l a s s o ( y ˜d+X , p o s t=TRUE) $ c o e f f i c i e n t s [ 2 ]

74
It is not surprising given that p = 63 is comparable to n = 90.
Emmanuel Flachaire Causal Machine learning
Heterogeneous treatment effects: high-dimensions
If d is a treatment, we can consider heterogeneous effects as

y = dα(X ) + g (X ) + ε

where α(X ) and g (X ) are approximated by linear combinations of


X or transformations of X , α(X ) = Z1 β and g (X ) = Z2 γ.75
The regression can be rewritten: y = dZ1 β + Z2 γ + ε
Several variables of interest β
Double Selection:
1 Lasso of y on Z2 : select variables important to predict y
2 Lasso of each interaction dZ1 on Z2 : select important variables
OLS of y on d and the union of the selected variables

→ assess heterogeneity with many determinants


75
Z1 and Z2 may include powers, b-splines, or interactions of X
Emmanuel Flachaire Causal Machine learning
Application 2: The effect of gender on wage

Several parameters of interest:

y = dα + dX β + Z γ + ε
y is the log of the wage, d is a dummy for female
dX are the interactions between d and each covariate in X
Z includes 2-ways interactions of the covariates Z = [X , X :X ]
The target variable is female d, in combination with other
variables dX
Our main interest is to make inference on α and β
If β = 0: homogeneous wage gender gap given by α
If β = 0: heterogeneous wage gender gap explained by X

Data: US Census in 2012, p = 116 and n = 2921776

76
for a recent application see Bach, Chernozhukov and Spindler (2021) pdf

Emmanuel Flachaire Causal Machine learning


Application 2: The effect of gender on wage

1 l i b r a r y ( hdm )
2 data ( cps2012 )
3 y <− c p s 2 0 1 2 $ lnw
4 X <− model . m a t r i x ( ˜−1+f e m a l e+f e m a l e : ( widowed+d i v o r c e d+
s e p a r a t e d+n e v e r m a r r i e d+h s d 0 8+h s d 9 1 1+h s g+cg+ad+mw+s o+we+
e x p 1+e x p 2+e x p 3 ) +(widowed+d i v o r c e d+s e p a r a t e d+n e v e r m a r r i e d
+h s d 0 8+h s d 9 1 1+h s g+cg+ad+mw+s o+we+e x p 1+e x p 2+e x p 3 ) ˆ 2 , d a t a
= cps2012 )
5 X<−X [ , w h i c h ( a p p l y (X , 2 , v a r ) !=0 ) ] #e x c l u d e constant v a r i a b l e s
6 i n d e x . g e n d e r <− g r e p ( ” f e m a l e ” , c o l n a m e s (X) )
7 e f f e c t s . f e m a l e<− r l a s s o E f f e c t s ( x=X , y=y , i n d e x=i n d e x . g e n d e r )
8 summary ( e f f e c t s . f e m a l e )

Generic approach to generate all covariates:


9 X c o l=c o l n a m e s ( c p s 2 0 1 2 ) [ 4 : 1 8 ]
10 d c o l=c o l n a m e s ( c p s 2 0 1 2 ) [ 3 ]
11 Xvar=p a s t e ( Xcol , c o l l a p s e = ”+” )
12 X i n t=p a s t e ( ” ( ” , p a s t e ( Xcol , c o l l a p s e=”+” ) , ” ) ˆ2 ” , s e p=” ” )
13 f m l a=p a s t e ( ” ˜−1+” , d c o l , ”+” , d c o l , ” : ( ” , Xvar , ” )+” , X i n t , s e p=” ” )
14 X<−model . m a t r i x ( a s . f o r m u l a ( f m l a ) , d a t a=c p s 2 0 1 2 )

Emmanuel Flachaire Causal Machine learning


Application 2: The effect of gender on wage
15 > summary ( e f f e c t s . f e m a l e )
16 [ 1 ] ” E s t i m a t e s and s i g n i f i c a n c e t e s t i n g o f t h e e f f e c t o f
target variables ”
17 E s t i m a t e . Std . E r r o r t v a l u e Pr ( >| t | )
18 female −0.154923 0.050162 −3.088 0 . 0 0 2 0 1 2 ∗ ∗
19 f e m a l e : widowed 0.136095 0.090663 1.501 0.133325
20 female : divorced 0.136939 0.022182 6 . 1 7 4 6 . 6 8 e −10 ∗ ∗ ∗
21 female : separated 0.023303 0.053212 0.438 0.661441
22 female : nevermarried 0.186853 0.019942 9 . 3 7 0 < 2 e −16 ∗ ∗ ∗
23 f e m a l e : hsd08 0.027810 0.120914 0.230 0.818092
24 f e m a l e : hsd911 −0.119335 0.051880 −2.300 0 . 0 2 1 4 3 5 ∗
25 female : hsg −0.012890 0.019223 −0.671 0 . 5 0 2 5 1 8
26 f e m a l e : cg 0.010139 0.018327 0.553 0.580114
27 f e m a l e : ad −0.030464 0.021806 −1.397 0 . 1 6 2 4 0 5
28 f e m a l e :mw −0.001063 0.019192 −0.055 0 . 9 5 5 8 1 1
29 female : so −0.008183 0.019357 −0.423 0 . 6 7 2 4 6 8
30 f e m a l e : we −0.004226 0.021168 −0.200 0 . 8 4 1 7 6 0
31 f e m a l e : exp1 0.004935 0.007804 0.632 0.527139
32 f e m a l e : exp2 −0.159519 0.045300 −3.521 0 . 0 0 0 4 2 9 ∗ ∗ ∗
33 f e m a l e : exp3 0.038451 0.007861 4 . 8 9 1 1 . 0 0 e −06 ∗ ∗ ∗

→ smaller gender gap for nevermarried or divorced female

Emmanuel Flachaire Causal Machine learning


Non-parametric framework

Emmanuel Flachaire Causal Machine learning


Homogeneous treatment effects: Partially linear model

Partially Linear Regression model PLR model


y = dα + g (X ) + ε
d = h(X ) + η
α is the target parameter, g and h are nuisance functions77
Naive ML approach:
1 ML of y − d α̂ on X → obtain ĝ (X )
2 OLS of y − ĝ (X ) on d → obtain α̂
Initialize with α̂ = 0 and iterate until convergence
However, α̂ is biased, because ĝ is not a good estimate of g 78

77
h maybe redondant, it is the propensity score in TE litterature
78
Since E (y |X ) 6= g (X ), a ML fit of y on X is not a good estimate of g
Emmanuel Flachaire Causal Machine learning
Homogeneous treatment effects: Partially linear model
Partially Linear Regression model PLR model
y = dα + g (X ) + ε
d = h(X ) + η
α is the target parameter, g and h are nuisance functions
Double Residuals (DR):
1 ML of y on X : compute residuals η̂y = y − ĝ (X )
2 ML of d on X : compute residuals η̂d = d − ĥ(X )
3 OLS of η̂y on η̂d → α̂
An application of FWL, or partialling out, with ML methods

Robinson (1988) shows that with DR α̂ is n-consistent, even
if ĝ (X ) and ĥ(X ) are consistent at slower rates79
The role of DR is to immunize α̂ against ML estimates: α̂ is
based on residuals η̂y and η̂d , which are ⊥ to ĝ (X ) and ĥ(X )
79
Robinson considers kernel regression. Chernozukhov et al. (2018) pdf
establish that any ML method can be used, so long as it is n1/4 -consistent
Emmanuel Flachaire Causal Machine learning
The role of double residuals (orthogonalization)
Distribution of α̂ − α0

Source: Chernozhukov et al. (2018)

Non-orthogonal ≡ naive ML Orthogonal ≡ Double Residuals

Emmanuel Flachaire Causal Machine learning


Homogeneous treatment effects: Partially linear model
Partially Linear Regression model PLR model
y = dα + g (X ) + ε
d = h(X ) + η
α is the parameter of interest, g and h are nuisance functions
Cross-fitting: split the sample into an auxiliary and a main
1 ML estimation of g (X ), h(X ) on auxiliary sample
2 Double Residuals estimation of α by OLS on main sample
α̂1 +α̂2
Flip the roles of both samples and average the results 2
Estimate nuisance fcts and target parameter on 6= samples
Chernozukhov et al. (2018) show that cross-fitting is crucial
to avoid overfitting

→ PLR: Double ML = Double Residuals + Cross-fitting

Emmanuel Flachaire Causal Machine learning


Heterogeneous treatment effects: Fully nonparametric
Interactive Regression Model IRM model
y = m(d, X ) + ε
d = h(X ) + η
d not additively separable → very general heterogeneity in TE
Parameter of interest: ATE = E[y1 − y0 ] 80

The estimator needs to check a Neyman-orthogonal condition


with respect to the nuisance functions (≡ DR in the PRL)
So the estimator and inference are robust to small mistakes in
the nuisance fonctions
The AIPW estimator turns out to check this ⊥ condition:
 
D(Y − m(1, X )) (1 − D)(Y − m(0, X ))
ATE = E m(1, X ) − m(0, X ) + −
h(X ) (1 − h(X ))

This estimator is doubly-robust: to small mistakes in m̂ and ĥ


80
The observed outcome is with or without treatment: y = y1 d + y0 (1 − d)
Emmanuel Flachaire Causal Machine learning
Heterogeneous treatment effects: Fully nonparametric

Interactive Regression Model IRM model


y = m(d, X ) + ε
d = h(X ) + η
d not additively separable → very general heterogeneity in TE
Double Machine Learning:81
1 Neyman orthogonal condition → AIPW estimator
2 Cross-fitting → ATE and m, h estimated from 6= samples

ATE estimation and inference with good properties


However, no detection and analysis of heterogeneity
→ IRM: Double ML = AIPW + Cross-fitting

81
Chernozhukov et al. (2018) pdf and Chernozhukov et al. (2017) pdf

Emmanuel Flachaire Causal Machine learning


Application 3: Insurance bonus on employment duration

RCT to investigate the incentive effect of unemployment


insurance (UI) bonus on unemployment duration:82
Individuals in the treatment groups were offered a cash bonus if they
found a job within some pre-specified period of time (qualification
period), provided that the job was retained for a specified duration

y is the log of duration of unemployment for the UI claimants


ATE estimation and inference in a PLR and IRM models
Pennsylvania Reemployment Bonus data set
Implementation is done with the R package DoubleML83

82
Individuals in the treatment groups were offered a cash bonus if they found
a job within some pre-specified period of time (qualification period), provided
that the job was retained for a specified duration
83
See vignette and Bach, Chernozhukov, Kurz, Spindler (2021) pdf
Emmanuel Flachaire Causal Machine learning
Application 3: ATE in a PLR model
1 l i b r a r y ( DoubleML )
2 l i b r a r y ( mlr3 )
3 # I n i t i a l i z a t i o n o f t h e Data−Backend
4 d a t a=f e t c h b o n u s ( r e t u r n t y p e=” d a t a . t a b l e ” )
5 y=” i n u i d u r 1 ”
6 d=” t g ”
7 x=c ( ” f e m a l e ” , ” b l a c k ” , ” o t h r a c e ” , ” dep1 ” , ” dep2 ” , ” q2 ” , ” q3 ” , ” q4 ” ,
” q5 ” , ” q6 ” , ” a g e l t 3 5 ” , ” a g e g t 5 4 ” , ” d u r a b l e ” , ” l u s d ” , ” hus d ” )
8 dml d a t a=DoubleMLData $new ( d at a , y c o l=y , d c o l s=d , x c o l s=x )
9 # I n i t i a l i z a t i o n o f t h e PLR Model
10 s e t . s e e d ( 3 1 4 1 5 ) #r e q u i r e d t o r e p l i c a t e s a m p l e s p l i t
11 l e a r n e r g=l r n ( ” r e g r . r a n g e r ” , num . t r e e s =500 , min . node . s i z e =2 ,
max . d e p t h =5) #Random F o r e s t from t h e r a n g e r p a c k a g e
12 l e a r n e r m=l r n ( ” r e g r . r a n g e r ” , num . t r e e s =500 , min . node . s i z e =2 ,
max . d e p t h =5)
13 dml p l r=DoubleMLPLR$new ( dml d ata ,
14 ml m = l e a r n e r m,
15 ml g = l e a r n e r g ,
16 s c o r e = ” p a r t i a l l i n g out ” ,
17 n f o l d s = 5 , n rep = 1)
18 # P e r f o r m t h e ATE e s t i m a t i o n and p r i n t t h e r e s u l t s
19 dml p l r $ f i t ( )
20 dml p l r $ summary ( )

Emmanuel Flachaire Causal Machine learning


Application 3: ATE in a PLR model

20 > dml p l r $ summary ( )


21 E s t i m a t e s and s i g n i f i c a n c e t e s t i n g o f t h e e f f e c t o f t a r g e t
variables
22 E s t i m a t e . Std . E r r o r t v a l u e Pr ( >| t | )
23 tg −0.07396 0.03540 −2.089 0.0367 ∗
24 −−−
25 S i g n i f . codes : 0∗∗∗ 0.001 ∗∗ 0.01 ∗ 0 . 0 5 . 0 . 1 1

Hence, we can reject H0 : α = 0 at the 5% significance level


It is consistent with the findings of previous studies that have
analysed the Pennsylvania Bonus Experiment
The ATE on unemployment duration is negative and
significant

Emmanuel Flachaire Causal Machine learning


Application 3: ATE in an IRM model
26 ## I n i t i a l i z a t i o n o f t h e IRM Model
27 # C l a s s i f i e r for propensity score
28 l e a r n e r c l a s s i f m = l r n ( ” c l a s s i f . r a n g e r ” , num . t r e e s = 5 0 0 ,
min . node . s i z e = 2 , max . d e p t h = 5 )
29 dml i r m=DoubleMLIRM$new ( dml d ata ,
30 ml m = l e a r n e r c l a s s i f m,
31 ml g = l e a r n e r g ,
32 s c o r e = ”ATE” , #o r ”ATTE”
33 n f o l d s = 10 , n rep = 1)
34 # P e r f o r m t h e e s t i m a t i o n and p r i n t t h e r e s u l t s
35 dml i r m $ f i t ( )
36 dml i r m $ summary ( )

37 E s t i m a t e s and s i g n i f i c a n c e t e s t i n g o f t h e e f f e c t o f t a r g e t
variables
38 E s t i m a t e . Std . E r r o r t v a l u e Pr ( >| t | )
39 tg −0.07345 0.03549 −2.069 0.0385 ∗
40 −−−
41 S i g n i f . codes : 0∗∗∗ 0.001 ∗∗ 0.01 ∗ 0 . 0 5 . 0 . 1 1

The estimated coefficient is very similar to the estimate of the ATE


in a PLR model and the conclusions remain unchanged.
Emmanuel Flachaire Causal Machine learning
Estimation of heterogeneity: Causal Forest

Causal Random Forest:84


Random Forest is modified to estimate the CATE directly
Grow a tree and evaluate its performance based on TE
heterogeneity rather than predictive accuracy
The idea is to find leaves where the treatment effect is
constant but different from other leaves
Split criterion: maximize heterogeneity in TE between leaves
Honest tree: build tree and estimate CATE from 6= samples

→ valid estimation and confidence intervals for CATE85

84
Wager and Athey (2018) pdf , Athey, Tibshirani and Wager (2019) pdf
85
RF√predictions are asymptotically unbiased and Gaussian, but cv rates
below n and they do not account for the uncertainty due to sample splitting
Emmanuel Flachaire Causal Machine learning
Detection and analysis of heterogeneity: Generic ML

Generic Machine Learning:86


Do not attempt to get valid estimation and inference on the
CATE itself, but on features of the CATE
Obtain ML proxy predictor of CATE (auxiliary set) and target
features of CATE based on this proxy predictor (main set)

Main interests:
Test if there is evidence of heterogeneity (BLP)

ATE for the 20% most (least) affected individuals? (GATES)

Which covariates are associated to TE heterogeneity? (CLAN)

→ valid estimation and inference on features of CATE

86
Chernozhukov, Demirer, Duflo and Fernàndez-Val (2020) pdf

Emmanuel Flachaire Causal Machine learning


Generic ML: Proxies of CATE
The main idea is to compute imperfect predictions of CATE and to
use them as proxies to make inferences on features of CATE:
Split the sample into a main set and auxiliary set (50/50 split)
Fit y ≈ m(1, X ) with treated group from the auxiliary sample
Fit y ≈ m(0, X ) with control group from the auxiliary sample
b i ) = m̂(1, Xi ) − m̂(0, Xi ) from the main sample
Compute S(X
S(X
b ) is used to learn about treatment effect heterogeneity

To control the uncertainty due to data splitting, this process is


done many times → cross-fitting87
b i ) are imperfect predictions of CATEi → proxies
The S(X 88

87
We randomly split the sample M times. The parameter estimates,
confidence bounds, and p-values reported are the medians across M splits.
88
CATEi = E[y1 − y0 |Xi ] = m(1, Xi ) − m(0, Xi )
Emmanuel Flachaire Causal Machine learning
Causal Machine Learning: A brief roadmap

Source: Gaillac and L’Hour (2021)

Emmanuel Flachaire Causal Machine learning


Underlying assumptions
Standard hypotheses: SUTVA, CIA and CSC
Common support condition (CSC): 0 < P(di = 1|Xi = x) < 1
ML estimation often provides better predictions
Adding covariates makes matching more difficult

Strittmatter and Wunsch (2021) The gender pay gap revisited with
big data: Do methodological choices matter?
Trimming in experiments vs. decomposition methods
→ Beware of CSC when moving away from RCT framework
Emmanuel Flachaire Causal Machine learning
Conclusion

The impact of ML for public policy evaluation:

Dealing with many covariates (p  n)


Relying less on a priori specification
Take care of heterogeneity
However, do not forget underlying assumptions! (CSC)

Technical literature, where implementation becomes easier


- Double Lasso: R package hdm
- Double Machine Learning: R package DoubleML
- Generic Machine Learning: R package GenericML
- Generalized Random Forest: R package grf

An effervescent empirical and theoretical literature

Emmanuel Flachaire Causal Machine learning


Selected references in Causal ML
Athey (2017) Beyond prediction: Using big data for policy problems, Science
Athey (2018) The impact of machine learning on economics
Athey, Tibshirani and Wager (2019) Generalized random forest, Ann. Statis.
Bach, Chernozhukov and Spindler (2021) Closing the U.S. gender wage gap
requires understanding its heterogeneity, arXiv:1812.04345
Belloni, Chernozhukov and Hansen (2014) Inference on treatment effects after
selection amongst high-dimensional controls, REStud
Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey and Robins (2018)
Double/debiased ML for treatment and structural parameters. Econometrics J.
Chernozhukov, Demirer, Duflo and Fernàndez-Val (2020) Generic ML inference
on heterogenous treatment effects in randomized experiments, arXiv:1712.04802
Gaillac and L’Hour (2020) Machine Learning for Econometrics, Lecture notes
Kleinberg, Ludwig, Mullainathan and Obermeyer (2015) Prediction Policy
Problems, AER P&P
L’Hour (2020), L’économétrie en grande dimension. INSEE M2020-01
Strittmatter (2020) What is the value added by using causal machine learning
methods in a welfare experiment evaluation.
Strittmatter and Wunsch (2021) The gender pay gap revisited with big data:
Do methodological choices matter? arXiv:2102.09207
Wager and Athey (2018) Estimation and inference of heterogeneous treatment
effects using random forests. JASA

Emmanuel Flachaire Causal Machine learning


References
Berk (2016) Statistical Learning from a Regression Perspective.
Springer Texts in Statistics, ch.3-7
Charpentier (2018), Classification from scratch website

Charpentier, Flachaire and Ly (2018), Econometrics and Machine


Learning, Economics and Statistics, 505 english french

Efron and Hastie (2016) Computer Age Statistical Inference,


Cambridge University Press, ch.17-19 pdf
Hastie, Tibshirani and Friedman (2009) The Elements of Statistical
Learning. Springer, ch.7, 9-12, 15-16 website pdf

Hastie, Tibshirani and Wainwright (2015) Statistical Learning with


Sparsity: The Lasso and Generalizations. CRC Press, website pdf

James, Witten, Hastie and Tibshirani (2021) An Introduction to


Statistical Learning. Springer, ch.2,5,6,8,9 website pdf

Watt, Borhani and Katsaggelos (2016) Machine Learning Refined.


Cambridge University Press, ch.1-6
Emmanuel Flachaire Causal Machine learning

You might also like