Machine Learning and Econometrics EF

Econometrics & Machine Learning
Emmanuel Flachaire
Aix-Marseille University, Amse
https://egallic.fr/ECB
December 21, 2021
Emmanuel Flachaire Econometrics & Machine Learning

Econometrics & Machine Learning
1 Introduction and General Principles

The two Cultures
Loss function and penalization
In-sample, out-sample and cross validation
2 Methods and Algorithms
Ridge and Lasso Regression
Classification and Regression Tree
Bagging and Random Forests
Boosting
Support Vector Machine
Neural Networks and Deep Learning
3 Using Machine Learning methods in Econometrics
Misspecification detection
Causal inference

1. Introduction and General Principle
The two Cultures

Statistical Modeling: The two Cultures1
There are two cultures in the use of statistical modeling to reach

conclusions from data:
Data Modeling Culture: one assumes that the data are
generated by a given stochastic data model (econometrics)
Algorithmic Modeling Culture: one uses algorithmic models

and treats the data mechanism as unknown (machine learning)
1
Léo Breiman, Statistical Science, 2001, Vol. 16, No. 3, 199-231
Statistical Modeling: The two Cultures
Léo Breiman (Statistical Science, 2001):
. . . an uncritical use of data models.

Misspecification bias
Let’s consider a quite general model:2 y = m(X ) + ε
Assume that X is fixed. The expected (squared) prediction
error, or EPE, is equal to
E (y − ŷ )2 = E [m(X ) + ε − m̂(X )]2

= E [m(X ) − m̂(X )]2 + Var(ε)
| {z } | {z }
Reducible Irreducible
The focus of Machine Learning is to estimate m with the aim

of minimizing the reducible error
Reducible error = MSE = [Bias(m̂(X ))]2 +Var(m̂(X ))
Assuming that the data are generated by a specific model, or
that the model is correctly specified, remains to assume that
the (misspecification) bias is zero: Bias(m̂)=0
2
y is a vector and X a matrix of observations, m a function, ε an error term
Misspecification bias: linear model
(Source: Berk, 2016)
Reducible error = mean

| function
{z error} + estimation error
misspecification bias

Misspecification bias: quadratic model
(Source: Berk, 2016)
Reducible error = mean

| function
{z error} + estimation error
misspecification bias

Econometrics and Machine Learning
Parametric econometric: we assume that the data come from

a generating process that takes the following form
y = Xβ + ε
→ probability theory is a foundation of econometrics
Machine learning: we do not make any assumption on how

the data have been generated
y ≈ m(X )
→ probability theory is not required
Nonparametric econometrics makes the link between the two

Machine Learning: an extension of nonparametric econometric

General Principle : optimization problem
Find the solution m

b to the optimization problem:
n
X
Minimize L(yi , m(Xi )) subject to kmk`q ≤ t (1)
m
i=1
which can be rewritten in Lagrangian form, for some λ ≥ 0:

n
X
Minimize L(yi , m(Xi )) + λ kmk`q (2)
m | {z }
i=1
| {z }
loss function penalization
The goal is to minimize a loss function under constraint

It is usually done by numerical optimization

General Principle : resolution by numerical optimization
Gradient Descent
Use linear approximations at each steps, from Taylor expansion
(Source: Watt et al., 2016)
Algorithm: Gradient descent

Input: differentiable function g , fixed step length α, initial point x 0
Repeat until stopping condition is met: w k = w k−1 − α g 0 (w k−1 )
Newton’s Method
Use quadratic approximations at each steps, from Taylor expansion3

Converges in fewer steps than gradient descent in convex fct
Does not require step length to be determine
3
The second order Taylor series approximation centered at w k is equal to
h(w ) = g (w k ) + g 0 (w k )(w − w k ) + 21 g 00 (w k )(w − w k )2
Newton’s Method
Use quadratic approximations at each steps, from Taylor expansion

It is used to find stationary points of a function: g 0 (w ) = 0.
It can lead to a maximum in concave function.

Regression: a simple Machine Learning method
Machine Learning (ML): solve the optimization problem
n
X
Minimize L(yi , m(Xi )) + λ kmk`q
m | {z }
i=1
| {z }
Let us consider:
L = `2 (Euclidian distance): L(yi , m(Xi )) = (yi − m(Xi ))2
m is a linear function of parameters: yi ≈ Xi β with β ∈ R p
no penalization: λ = 0
Thus, we have: ( n
)
X
β
b = argmin (yi − Xi β)2 ,
i=1
It is the minimization of the Sum of Squared Residuals (SSR) in a

b is the OLS estimator.4
linear regression model, that is, β
4
A Gradient Descent method can be used to solve this optimization problem.
Linear regression from a Machine Learning perspective
Let us consider the simple linear regression model:
y = β0 + β1 x + ε (3)
From a Machine Learning perspective:

The linear regression provides the best straight line
approximation of the relationship between y and x 5
OLS estimators are obtained by minimizing prediction errors.
No probability theory is required!
Econometrics put statistical assumptions on (3) in order to derive

properties of the OLS estimators and to make inference.6
5
In the sense that it minimizes prediction errors
6
convergence, unbiased/biased estimators, BLUE, statistical tests, etc.
Classification: a simple Machine Learning method

we aim to learn a hyperplane X β = 0 (shown here in black)
to separate feature representations of the two classes.7
left panel: perfect linear separation
right panel: two overlapping classes → minimize the number
of missclassified points that end up in the wrong half-space.
7
X = [ι, x1 , x2 ] is a n × 3 matrix.
Classification: the perceptron
A hyperplane placing the points on its correct side is as follows:
X β > 0 if yi = +1
X β < 0 if yi = −1
In other words, with y ∈ {−1, +1}:

if yi is correctly classified: yi (Xi β) > 0
if yi is missclassified: yi (Xi β) < 0
To minimize the aggregated distance of missclassified points to the
hyperplane, we can solve
n
X
Minimize max (0, −yi (Xi β)) ,
β
i=1
where max (0, −yi (Xi β)) is the perceptron or max loss function.
Classification: smooth version of the perceptron

The perceptron loss function is non-differentiable (in green).8
The softmax loss function is a smooth approximation (black):9
g (s) = softmax(0, s) = log(1 + e s )
8
g (s) = max(0, s)
9
g (s) = log(1 + e s ). Gradient descent and Newton’s methods can be used
Classification: softmax and perceptron

Minimizing the softmax loss function gives β,
b that define:
the linear separator X β
b = 0 shown in the left panel,
b − 1 shown in the right panel.
the surface y (x) = 2Λ(X β)
The softmax model is a smooth approximation of the perceptron
Classification: logit regression and perceptron
Minimize the softmax loss function:10
n
X
Minimize log 1 + e −yi (Xi β) ,
β
i=1
is similar to maximize the log-likelihood in a logit model:

n
X
Maximize yi0 log Λ(Xi β) + (1 − yi0 ) log (1 − Λ(Xi β))
β
i=1
ex
with yi0 ∈ {0, 1} and Λ(x) = 1+e x = 1
1+e −x
is the logistic function11
→ Logit model = softmax model

→ The logit model is a smooth approximation of the perceptron
10
softmax(0, −yi (Xi β)) = log(1 + e −yi (Xi β) )
11
log Λ(Xi β) = − log(1 + e −Xi β ) and log(1 − Λ(Xi β)) = − log(1 + e Xi β )
Classification: a simple Machine Learning method
Machine Learning: solve the optimization problem
n
X
m | {z }
i=1
| {z }
Let us consider:12
the softmax loss function: L = softmax(0, −yi (Xi β))
no penalization: λ = 0.
Thus, we have:13 ( n )
X
−y (X β)
βb = argmin log 1 + e i i ,
i=1
which is similar to maximize the log-likelihood in a logit regression

model, that is, β
b is the MLE estimator.
12
yi ≈ m(Xi ) = 1± (Xi β ≥ 0) = {+1 if Xi β ≥ 0 ; −1 if Xi β < 0}
13
A Gradient Descent method can be used to solve this optimization problem.
Logit regression from a Machine Learning perspective
Let us consider the logit regression model:14
E(y 0 |X ) = P(y 0 = 1) = Λ(X β) (4)
From a Machine Learning perspective:

The logit model is a smooth approximation of the perceptron
MLE estimator is obtained by minimizing classification errors.
No probability theory is required!
Econometrics put statistical assumptions on (4) in order to derive

properties of the MLE estimator and to make inference.
14
Since y 0 = {0, 1}, then E(y 0 |X ) = 0 × P(y 0 = 0) + 1 × P(y 0 = 1)
Linear/Logit models from a Machine learning perspective
Optimal parameters: Minimize prediction/classification errors

The convenience of convexity:
A unique solution is easily obtained numerically/analytically.

Using probability theory, properties of the optimal parameters
are derived and inference can be drawn (Econometrics)

Moving beyond linearity: Regression
Non-linearity (left panel) and interaction effects (right panel).

Knowledge-driven feature design are used in Econometrics.15
Automatic feature design is used in Machine Learning
15
fixed transformed covariates: quadratic, cubic, etc. and/or cross-products
Moving beyond linearity: Classification
Non-linearity (left panel) and interaction effects (right panel).

Knowledge-driven feature design are used in Econometrics.16
Automatic feature design is used in Machine Learning
16
fixed transformed covariates: quadratic, cubic, etc. and/or cross-products
Nonparametric Econometrics
Machine Learning:
High non-linearity and strong interaction effects are taken into
account with automatic feature design.
In general, a non-convex function is minimized.
Nonparametric Econometrics:
A nonparametric regression take into account such effects.
It may work well in small-dimension, not in high dimension.17
Machine Learning is an extension of Nonparametric Econometrics.
17
Because of the curse of dimensionality. Note that GAM models may
capture automatically non-linearities, but not interaction effects.
The two Cultures

General Principle

n
X
m | {z }
i=1
| {z }
Choice of the loss function:

L → conditional mean, quantiles, expectiles
m → linear, logit, splines, tree-based models, neural networks
Choice of the penalization:

`q → lasso, ridge
λ → over-fitting, under-fitting, cross validation

Loss funct: Tradeoff between flexibility & interpretability
(Source: James et al., 2013)

Over-fitting
A model with high flexibility may fit perfectly observations used for
estimation, but very poorly new observations
4
estimation: λ=0
true model
2
0
y
−2
−4
0.0 0.2 0.4 0.6 0.8 1.0
→ penalization: put a price to pay for having a more flexible model

Under-fitting
If we put a huge cost for a more complex model, λ = ∞, we

obtain a linear regression model
4
estimation: λ=∞
2
true model
0
y
−2
−4
0.0 0.2 0.4 0.6 0.8 1.0
→ if the cost is too large: low variance, but high bias

Penalization: Tradeoff between bias & variance
Penalization: put a price to pay for a having more flexible model
λ = 0: it interpolates data . . . . . . . . . . . low bias, high variance

λ = ∞: linear model . . . . . . . . . . . . . . . . . high bias, low variance
→ the penalty parameter λ ≡ bias/variance tradeoff
Role of λ: avoid over-fitting and poor prediction with new data
Choice of λ: automatic selection procedures are based on model’s

performance evaluated out-sample, by cross-validation

The two Cultures

Model assessment
The best model has the lowest prediction error. With squared
error loss, the mean squared prediction error is equal to
n
1X SSR
MSE = b λ (xi ))2 =
(yi − m
n n
i=1
Due to overfitting, we cannot use SSR and R 2 based on the

sample used for estimation (≡ in-sample, training sample)
We are interested in the accuracy of the predictions obtained

from previously unseen data (≡ out-sample, test sample)
The in-sample MSE (training error) can be a poor estimate of

the out-sample MSE (test error)

Model assessment
In order to select the best model with respect to test error, we

need to estimate this test error (out-sample MSE)
There are two common approaches:
We can indirectly estimate test error by making an adjustment
to the training error to account for the bias due to overfitting
2
→ penalization ex-post . . . Radj , AIC, BIC
We can directly estimate the test error, using either a
validation set approach or a cross-validation approach
→ penalization ex-ante
CV provides a direct estimate of test error, makes fewer

assumptions about the true model and can be used widely
In the past, performing CV was computationally prohibitive.

Nowadays, the computations are hardly ever an issue
Out-sample: Validation set

We split randomly the sample in two groups of observations: a
training set (q − 1 obs.) and a validation/test set (n − q obs)
n
1 X
1 estimation, n − q values → MSE = (yi − ŷi )2
n−q
i=q

Cross-Validation: LOOCV or n-fold CV18
(Source: James et al., 2013) n

1X
n estimations, n values → MSE = (yi − ŷi )2
n
i=1
18
LOOCV: leave-one-out cross-validation
Cross-Validation: K -fold CV

K
1XX
K estimations, n values → MSE = (yi − ŷi )2
n
k=1 i∈Gk

Prediction error in-sample vs. out-sample
← underfit λ−1 overfit →
(Source: Hastie et al., 2009)
Underfitting: the model performs poorly on training and test samples

Overfitting: performs well on training sample, but generalizes poorly on test sample

Standardization and Normalization
Several ML methods are sensitive to the units of the covariates
as Ridge/Lasso regressions, SVM and Neural Networks
The results may differ substantially when multiplying a given

covariate by a constant (meters/kilometers, kilograms/grams)
It is best to standardize the data before using these methods:

x x − x̄
p or p
Var(x) Var(x)
so that they are all on the same scale
Normalization is another scaling technique where the values

end up ranging between 0 and 1:
x − xmin
xmax − xmin

2. Methods and Algorithms
Boosting
Emmanuel Flachaire Ridge and Lasso Regressions

Introduction
Linear regression
y = Xβ + ε n observations, p covariates
Least Squares
Collinearity or many irrelevant covariates → high variance
More covariates than observations, p > n → undefined
Ridge and Lasso

Collinearity, many irrelevant covariates → smaller variance
High-dimensional data analysis, p n → feasible

Shrinkage Methods
n
X p
X
2
Minimize (yi − α − Xi β) + λ |βj |q
α,β
i=1 j=1
Pp q
It is equivalent to minimize SSR subject to j=1 |βj | ≤c
No penalization correponds to OLS unbiased estimation

The penalization restricts the magnitude of the coefficients
It shrinks the coefficients toward 0 as λ % (or c &)
It introduces some bias in the coefficients
→ Add some bias if it leads to a substantial decrease in variance

Shrinkage Methods
n
X p
X
Minimize (yi − α − Xi β)2 + λ |βj |q
α,β
i=1 j=1
Pp q
Idea: biased coeff may result in model with smaller MSE

The penalty term λ is a bias-variance tradeoff
λ is selected by cross-validation (MSE out-sample)
Overall, use shrinkage methods when OLS exhibits large variance

(with many irrelevant or highly correlated covariates)

Standardization
n
X p
X
Minimize (yi − α − Xi β)2 + λ |βj |q
α,β
i=1 j=1
Pp q
The results are sensitive to the scale of the covariates

If a covariate is divided by 10, its coefficient is multiplied by
10, which has an impact on the constraint
It is best to standardize covariates before using shrinkage
methods, so that they are all on the same scale:
x x − x̄
p or p
Var(x) Var(x)

Ridge Regression
n
X p
X
2
Minimize (yi − α − Xi β) + λ βj2
α,β
i=1 j=1
Ridge = shrinkage method based on the `2 norm (q = 2)
The restriction is convex and makes the problem easy to solve:
β̂ = (X > X + λIn )−1 X > y
where In is the n × n identity matrix

λ > 0: (X > X + λIn ) non-singular even if X is not of full rank
Ridge method is defined in high-dimensional problems p n

Lasso Regression
n
X p
X
Minimize (yi − α − Xi β)2 + λ |βj |
α,β
i=1 j=1
Lasso = shrinkage method based on the `1 norm (q = 1)
The restriction is convex and makes the problem easy to solve

numerically, but there is no close expression as in ridge
The nature of the constraint will cause some coefficents to be
exactly zero, with λ sufficiently large (or c sufficiently low)
Lasso makes variable selection with many irrelevant variables
Lasso is appropriate with sparse model, in which only a
relative small number of covariates play an important role

Lasso vs. Ridge
Unlike the Ridge constraint, the Lasso constraint has corners

If the solution occurs at a corner, it has one parameter equal to O
Lasso vs. Ridge

q q
The x-axis is the factor c, from |β1 | + |β2 | ≤ c, normalised to 1
Lasso: many coef. are exactly zero with low c → variable selection
Lasso and Variable Selection
p
X
Lasso constraint: |βj | ≤ c
j=1
The optimal c for prediction and variable selection are different:
For variable selection, the optimal parameter c shrinks

non-zero coefficients toward zero → bias
For prediction, the optimal parameter c is often larger than
for selection, to reduce the bias on non-zero coefficients
Lasso selects λ or c by CV, based on MSE → for prediction
Lasso often includes too many variables (c is often too large)
But the true model is very likely a subset of these variables

The one standard error rule
Breiman et al. (1984) proposed a rule-of-thumb:19
Lasso selects λ by CV, based on MSE → for prediction

Consider values of λ within a 1-standard error interval
Pick the largest value of λ within this interval (smallest c)
The main idea of the 1 SE rule is to choose the most parcimonious

model whose accuracy is comparable with the best model
It is a rule-of-thumb, expected to provide a value of λ in between

the optimal one for prediction and the optimal one for selection
19
Breiman, Friedman, Stone, Olshen (1984) Classification and Regression
Trees
Simulation results with uncorrelated covariates
Linear regression model with many covariates, n = 1000

Monte Carlo simulation with 1000 replications
λ̂min is selected by CV, λ̂1se with the 1 SE rule
Potency: proportion of relevant variables selected
Gauge: proportion of irrelevant variables selected
→ Lasso with λ̂min selects 29.9% of irrelevant variables

→ Lasso with λ̂1se selects 3.2% of irrelevant variables, but MSE %

Simulation results with correlated covariates
Linear regression model with many covariates, n = 1000

Monte Carlo simulation with 1000 replications
λ̂min is selected by CV, λ̂1se with the 1 SE rule
Potency: proportion of relevant variables selected
Gauge: proportion of irrrelevant variables selected
→ Lasso with λ̂min selects 64.3% of irrelevant variables

→ Lasso with λ̂1se selects 45.8% of irrelevant variables, MSE %

Adaptive Lasso
The Adaptive Lasso is based on the following constraint:20

p
X
wj |βj | ≤ c where wj = 1/|β̂j |ν
j=1
where β̂j is the OLS estimate and ν > 0.

Put smaller weights to larger coefficients in the constraint
Large non-zero coefficients shrink more slowly to zero as c &
This leads to the oracle property, simultaneously achieving
Consistent variable selection
Optimal estimation prediction
ν is often set equal to 1, but it could be selected by CV

20
Zou (2006), JASA, 101 1418-1429
Elastic-net
The Elastic-net is based on the following constraint:21

p
X
(r βj2 + (1 − r )|βj |) ≤ c
j=1
where r = 1 corresponds to the Ridge and r = 0 to the Lasso.

Lasso may perform poorly with highly correlated covariates,
which is often encountered in high-dimensional data analysis
By combining a `2 -penalty with the `1 -penalty, we obtain a
method that deals better with such correlated groups, and
tends to select the correlated covariates (or not) together.
Like Lasso, Elastic-net often includes too many covariates
21
Zou and Hastie
P (2005), JRSSP Series B, 67 301-320. It corresponds to the
penalization λ1 pj=1 βj2 + λ2 pj=1 |βj |, where λ1 and λ2 are selected by CV.
Adaptive Elastic-net
The Adaptive Elastic-net is based on the following constraint:22

p
X
r βj2 + (1 − r )wj |βj | ≤ c

j=1
where wj = 1/(|β̂j | + n1 )ν and ν > 0.23
Adaptive Lasso has oracle property (consistent vble selection),

but inherits the instability of Lasso for high-dimensional data
Elastic-net deals better in high-dimensional data analysis,
but it lacks the oracle property
Adaptive Elastic-net combines the two approaches
22
Zou and ZhangP (2009) AnnalsPof Statistics, 37, 1733-1751. It remains to
the penalization λ1 pj=1 βj2 + λ2 pj=1 wj |βj |, where λ1 , λ2 are selected by CV
23
1/n in wj is used to avoid division by 0
Application: Predict baseball player’s Salary
What predictors are associated with baseball player’s Salary?
Salary – 1987 annual salary on opening day in thousands of dollars;
Years – Number of years in the major leagues;
Hits – Number of hits in 1986;
Atbat – Number of times at bat in 1986;
...
1 l i b r a r y ( ISLR )
2 H i t t e r s=na . o m i t ( H i t t e r s )
3 x=model . m a t r i x ( S a l a r y ˜ . , H i t t e r s ) [ , − 1 ]
4 y=H i t t e r s $ S a l a r y
5 # R i d g e and L a s s o
6 l i b r a r y ( glmnet )
7 r i d g e . model=g l m n e t ( x , y , a l p h a =0)
8 l a s s o . model=g l m n e t ( x , y , a l p h a =1)
9 p a r ( mfrow=c ( 1 , 2 ) )
10 p l o t ( r i d g e . model , main=” R i d g e ” )
11 p l o t ( l a s s o . model , main=” L a s s o ” )
By default, the covariates are standardized, otherwise use the argument

standardize=FALSE in the function glmnet
Application: Coefficient paths
Coefficient paths for Ridge and Lasso as c increases

Application: Cross Validation
1 c v . r i d g e=c v . g l m n e t ( x , y , a l p h a =0)
2 c v . l a s s o=c v . g l m n e t ( x , y , a l p h a =1)
3 p l o t ( c v . r i d g e , main=” R i d g e ” )
4 p l o t ( c v . l a s s o , main=” L a s s o ” )

Application: Adaptive Lasso and Adaptive Elastic-net
1 o l s=lm ( S a l a r y ˜ . , H i t t e r s )
2 w=1/ a b s ( c o e f ( o l s ) )
3 c v . a d a l a s s o <− c v . g l m n e t ( x , y , a l p h a =1 , p e n a l t y . f a c t o r=w)
4 c v . a d a e l a s t <− c v . g l m n e t ( x , y , a l p h a = . 5 , p e n a l t y . f a c t o r=w)
5 p l o t ( c v . a d a l a s s o , main=” A d a p t i v e L a s s o ” )
6 p l o t ( c v . a d a e l a s t , main=” A d a p t i v e E l a s t i c −n e t ” )

Application: Compare the coefficients
1 c o e f 1=c o e f ( o l s )
2 c o e f 2=c o e f ( c v . r i d g e , s=” lambda . min ” )
3 c o e f 3=c o e f ( c v . l a s s o , s=” lambda . min ” )
4 c o e f 4=c o e f ( c v . l a s s o , s=” lambda . 1 s e ” )
5 c o e f 5=c o e f ( c v . a d a l a s s o , s=” lambda . min ” )
6 c o e f 6=c o e f ( c v . a d a e l a s t , s=” lambda . min ” )
7 o p t i o n s ( s c i p e n = 999) # d i s a b l e s c i e n t i f i c n o t a t i o n
8 c o e f f=c b i n d ( c o e f 1 , c o e f 2 , c o e f 3 , c o e f 4 , c o e f 5 , c o e f 6 )
9 c o l n a m e s ( c o e f f ) <− c ( ” o l s ” , ” r i d g e ” , ” l a s s o ” , ” l a s s o . 1 s e ” , ”
adaLasso ” , ” a d a E l a s t i c ” )
10 coeff

Application: Compare the coefficients
Shrinkage methods: the coefficients are shrunk towards zero

Variable selection is sensitive to the method
Summary
Ridge and Lasso ca be used in high-dimension (p n)

They are based on a bias-variance tradeoff
Tradeoff selected minimizing MSE out-sample by CV
Sparse models: Lasso is a variable selection method
Ridge puts similar coefficients to strongly correlated variables,
while Lasso selects one randomly
Extension to adaptive Lasso and Elastic-net

Boosting
Emmanuel Flachaire Classification and Regression Tree (CART)

Classification Tree
y ∈ {0, 1} is a qualitative variable

Classification Tree: Principle from a small sample
Find the best rule on a single variable to classify black/white balls
→ find a cutoff on x1 or x2 such that the maximum number of

observations is correctly classified 24
24
See https://freakonometrics.hypotheses.org/52776
Classification Tree: graphical representation
Minimizing misclassification, we find x2 < k, where k ∈ (0.56, 0.8)
This Figure represents the best split in a competition between all

possible splits of x1 and x2.
From this simple rule, two bullets are misclassified . . . we can try to
find a new rule in the white area sub-group . . .
Classification Tree: A sum of simple rules
The additional rule x1 ≥ c, where c ∈ (.16, .2), produces the best

subsequent split:
Using these two rules, all bullets are correctly classified

Classification Tree: Extension to large sample
Interpretation is quite easy and intuitive
We use recursive binary splitting to grow a tree
A tree can grow until every observations is correctly classified
With a large sample, a tree may have many nodes, that is,
many points where the predictor is splitted into two leaves
Note that this principle is easy to apply, even with several

regressors and several classes

Classification Tree: Example with 100 observations
The resulting tree is quite complex and not so easy to interpret

Classification: Tree pruning
An unpruned tree:
classifies correctly every observation from a training sample
may classify poorly observations from a test sample
(it is the standard problem of overfitting)
maybe difficult to interpret
A pruned tree:
is a smaller tree with fewer splits
might perform better on a test sample
might have an easier interpretation
→ define a criterion for making binary splits

A fully grown tree fits perfectly training data, poorly test data
Tree pruning is used to control the problem of overfitting
A smaller tree with fewer splits might lead to lower variance

and better interpretation at the cost of a little bias
Poor strategy: Add new split only if it is worthwhile to do so
However, a poor split could be followed by a very good split
Good strategy: Grow a very large tree and prune it back in

order to obtain a subtree, keep a split only if it is worthwhile
→ We need to define what ”only if it is worthwhile” means

Classification tree: Gini impurity index
Classification: not only concerned by class prediction, also by
class proportion → Min impurity rather than misclassification
Gini impurity index at some node N :25

K
X K
X
G (N ) = pk (1 − pk ) = 1 − pk2
k=1 k=1
with pk the fraction of items labeled with class k in the node
Node: 100-0 or 0-100→ minimal impurity/diversity: G = 0,

Node: 50-50 → maximal impurity/diversity: G = 1/4.26
A small value means that a node contains predominantly

observations from a single class (homogeneity)
25
Another index is the Entropy “impurity” index E (N ) = − Kk=1 pk log pk
P
26
With 100-0: p1 = 1, p2 = 0 ; 0-100, p1 = 0, p2 = 1 ; 50-50: p1 = p2 = 1/2
If we split the node into two leaves, NL and NR , the Gini

impurity index becomes
G (NL , NR ) = pL G (NL ) + pR G (NR )
where pL , pR are the proportion of observations in NL , NR
When do we split? . . . when impurity is reduced substantially:
∆ = G (N ) − G (NL , NR ) >
we can also require a minimum of observations per node
How do we split? . . . find the cutoff on a single variable that

minimise impurity rather than misclassification (→ max ∆)

Classification tree: Limitation
Small change in the original sample ⇒ Tree may differ significantly

Application: Predict survival on the Titanic
Consider passenger data from the sinking of the Titanic
What predictors are associated with those who perished
compared to those who survived?
survived – 1 if true, 0 otherwise;
sex – the gender of the passenger;
age – age of the passenger in years;
pclass – the passenger’s class of passage;
sibsp – the number of siblings/spouses aboard;
parch – the number of parents/children aboard.
1 l i b r a r y (PASWR) # get the data

2 data ( t i t a n i c 3 )
3 library ( rpart ) # CART p a c k a g e
4 l i b r a r y ( rpart . plot ) # fancy plots
5 X=c b i n d ( s e x , age , p c l a s s , s i b s p , p a r c h )
6 t r e e <− r p a r t ( s u r v i v e d ˜X , d a t a=t i t a n i c 3 , method=” c l a s s ” )
7 p r p ( t r e e , e x t r a =1 , f a c l e n =5 , box . c o l=c ( ” i n d i a n r e d 1 ” , ”
aquamarine ” ) [ t r e e $ frame $ y v a l ] )

Application: Titanic classification tree

Application: Variable importance
The importance of each variable, related to the gain in Gini, is
1 t r e e $ v a r i a b l e . importance
1 sex pclass sibsp age parch

2 172.74924 50.78568 27.33127 20.95528 20.46938
that we can plot using

1 b a r p l o t ( t r e e $ v a r i a b l e . i m p o r t a n c e , h o r i z=TRUE, c o l=” y e l l o w 3 ” )

Regression Tree
y ∈ R is a quantitative variable

Regression Tree: Principle with one covariate
Find the best split, which minimizes deviations to the mean
(variances) in each leaf:
(yi − ȳL )2 + (yi − ȳR )2

P P
→ find cutoff on x such that: Min
xi ∈NL xi ∈NR

Regression Tree: Principle with one covariate
Then use recursive binary splitting:

Regression Tree: Principle with two covariates
With two covariates, y ≈ m(x1 , x2 ), we have:

J P
Find boxes R1 , . . . , RJ that minimize SSR:1 Min (yi − ȳRj )2
P
j=1 xi ∈Rj
1
We cannot consider every possible partition → use recursive binary splitting
Application: Predict baseball player’s Salary
Let’s consider 3 covariates: y ≈ m(x1 , x2 , x3 )
What predictors are associated with baseball player’s Salary?

Salary – 1987 annual salary on opening day in thousands of dollars;
Years – Number of years in the major leagues;
Hits – Number of hits in 1986;
Atbat – Number of times at bat in 1986;
2 # remove o b s e r v a t i o n s t h a t a r e m i s s i n g S a l a r y v a l u e s
3 d f=H i t t e r s [ c o m p l e t e . c a s e s ( H i t t e r s $ S a l a r y ) , ]
4 # l o a d CART l i b r a r y
5 library ( rpart )
6 library ( rpart . plot )
7 # estimate the t r e e
8 t r e e <− r p a r t ( l o g ( S a l a r y ) ˜ Y e a r s+H i t s+AtBat , d a t a=d f , cp =0)
9 # p l o t the t r e e
10 p r p ( t r e e , e x t r a =1 , f a c l e n =5)

Regression Tree: Principle with several covariates
With more covariates, we can only use the decision tree figure:
based on the same principle: find terminal nodes that min SSR
Regression: Tree pruning
A fully grown tree fits perfectly training data, poorly test data
Tree pruning is used to control the problem of overfitting

Poor strategy: Build the tree only so long as the decrease in

the SSR due to each split exceed some threshold
However, a poor split could be followed by a very good split
Good strategy: Grow a very large tree and prune it back in

order to obtain a subtree, keep a split only if it is worthwhile

Regression: Tree Pruning
Penalization: we put a prize to pay for having a tree with

many terminal nodes J, or regions,
J X
X
Min (yi − ȳRj )2 + λ J
j=1 i∈Rj
For given λ, we can find the subtree minimizing this

criterion27
Cross-validation: we select λ using cross validation

27
λ is called the complexity parameter
Tree pruning: Application
1 t r e e <− r p a r t ( l o g ( S a l a r y ) ˜ Y e a r s+H i t s+AtBat , d a t a=d f ) # CV
2 p r p ( t r e e , e x t r a =1 , f a c l e n =5)
Unpruned tree Pruned tree

Tree versus linear model
Tree vs. linear model: Which model is better?
It depends on the problem at hand:

K
P
Linear regression: m(X ) = β0 + Xj βj
j=1
J
P
Regression tree: m(X ) = cj 1(x ∈ Rj )
j=1
If the relationship between y and x1 , ..., xK is linear: a linear

model should perform better
If the relationship between y and x1 , ..., xK is highly non-linear

and complex: a tree model should perform better

True decision boundary: linear
1.0 logit tree
1.0
0.8
0.8
0.6
0.6
Variable 2
Variable 2
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Variable 1 Variable 1
Blue area ≡ fitted values in blue from linear (left) and tree (right) models

True decision boundary: nonlinear
1.0 logit tree
1.0
0.8
0.8
0.6
0.6
Variable 2
Variable 2
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

True decision boundary: interactions
1.0 logit tree
1.0
0.8
0.8
0.6
0.6
Variable 2
Variable 2
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Classification And Regression Tree (CART)
Advantages:
Trees tend to work well for problems where there are

important nonlinearities and interactions
The results are really intuitive and can be understood even by

people with no experience in the field
Disadvantage:
Trees are quite sensitive to the original sample (non-robust)
They may have poor predictive accuracy out-sample

Ridge and Lasso regression
Boosting
Emmanuel Flachaire Bagging and Random Forest

Bagging and Random Forest
How bagging and random forest work intuitively:
Based on your symptoms, suppose a doctor diagnoses an

illness that requires surgery
Instead asking one doctor, you may choose to ask several
If one diagnosis occurs more than any others, you may choose
this one as the final diagnosis
→ the final diagnosis is made based on a majority vote of doctors
Bagging and Random Forest: replace doctors by bootstrap samples

Bagging: algorithm
Algorithm 1: Bagging
Select number of trees B, and tree depth D;
for b = 1 to B do
generate a bootstrap sample from the original data;
estimate a tree model of depth D on this sample;
end
For instance, with the titanic dataset:

1 library ( rpart ) ; library ( rpart . plot )
2 l i b r a r y (PASWR) ; d a t a ( t i t a n i c 3 )
3 n = NROW( t i t a n i c 3 $ s u r v i v e d )
4 p a r ( mfrow=c ( 3 , 3 ) )
5 for ( i in 1:9) {
6 i d x = s a m p l e ( 1 : n , s i z e=n , r e p l a c e=TRUE)
7 c a r t = r p a r t ( a s . f a c t o r ( s u r v i v e d ) ˜ s e x+a g e+p c l a s s+s i b s p+
p a r c h , d a t a= t i t a n i c 3 [ i d x , ] , cp =0) # u n p r u n e d
8 p r p ( c a r t , t y p e =1 , e x t r a =1)}

Bagging: Generate several trees by bootstrapping
sex = mal
yes
sex = mal
no sex = mal no
yes no
yes
0
0 786 523
791 518 0
814 495
age >= 14 pclass = 3rd
pclass = 2nd,3rd pclass = 3rd age >= 13 pclass = 3rd 0 1

673 182 113 341
0 1
669 176 122 342 0 1 pclass = 2nd,3rd sibsp >= 2.5 sibsp >= 2.5
683 142 131 353 1
15 241
0 1 1
651 146 22 36 98 100
age >= 34 age >= 32 sibsp >= 2.5 pclass = 2nd,3rd sibsp >= 2 parch >= 1.5 age >= 57
1
20 240
pclass = 2nd age >= 54 parch >= 3.5
0 1 0
0 0 0
16 3 6 33 15 3
558 113 111 63 102 102 0 0 1 1
646 116 37 26 106 117 25 236 0 0 1
532 72 119 74 83 97
age >= 14 age < 48 sibsp < 0.5 age < 14 age >= 28 parch < 0.5 age >= 61 sibsp >= 1.5 age >= 23 sibsp < 0.5 age < 19 age < 52 age >= 1.5
0
0 1 1 0
153 9 33 2
0 0
6 1
129 5
32 0 5 26 6 9
0 0 1
0 0 1 0 1
429 108 94 40 17 23 21 5 81 97
0 0 0 1 1 379 63 86 72 77 96
512 71 134 45 30 15 76 102 19 227

age >= 40 age >= 37 age < 3.5
0 1 1
36 1 0 9 2 7
age < 29 sibsp >= 2 age >= 45 parch >= 0.5 age >= 28 parch < 0.5 age >= 32 age >= 36 age >= 28 age < 38 sibsp >= 0.5 parch < 0.5
1 0 1 0 0 0 0 1 0 0 1
343 62 86 63 75 89
5 11 19 0 2 5 38 12
40 13 17 0 16 2 5 112
0 0 0 0 0 1
398 87 31 21 63 18 31 22 12 12 43 85
0 0 0 1 1 1
parch < 0.5 parch >= 0.5 age < 23 age >= 7
472 58 117 45 14 13 50 58 26 44 14 115 0 0
46 3 6 1
0 0 0 1
297 59 40 20 46 43 69 88
age < 19 sibsp >= 0.5 age >= 10 age < 34 age >= 54 age >= 24 age >= 45 age < 34 sibsp < 0.5 parch < 0.5 age < 15 age >= 50
0 0 0 0 1 1 0 0 0 1 1 1
23 3 18 0 16 3 6 2 6 10 6 22 344 51 85 26 13 6 1 7 2 7 0 29 age < 39 age < 49 sibsp < 0.5 age < 15
0 0 0 1
0 0 1 0 1 1
337 56 61 31 8 18 45 18 15 19 37 63
0 0 1 0 1 1 9 8 9 1 6 2 0 8
128 7 32 19 48 51 16 15 10 29 14 86 0 0 1 1
288 51 31 19 40 41 69 80
age >= 20 pclass = 3rd age >= 37 age < 44 pclass = 2nd age >= 33 age >= 29 parch < 0.5
0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 1 0 1 1 0 0 1 1 0
35 0 19 3 5 4 3 14 8 0 11 5 4 14 7 3 30 60 48 0 32 12 0 7 36 35 12 16 10 2 6 13 4 3 6 26 3 5 4 3 27 12 4 7 4 12 7 1
0 0 0 0 0 1
302 56 42 28 37 18
0 1 284 48 36 29 62 79
80 7 11 81
age < 32 age >= 20

0 0 1 1
age < 20 age >= 33 age >= 22 21 0 30 18 6 11 10 25
0 1 0 1 0 1 1 0 1
286 40 9 12 35 11 2 7 77 2 3 5 3 52 263 48 52 54
0 0
1
16 16 33 16
8 29 age >= 28 age < 22
1 1
6 7 6 8
0 0
age < 31 age < 26 257 41 46 46
0 1 0 1
9 3 7 13 10 0 0 7
age < 27 age >= 27
0
1 0
42 6 9 0
0
23 16
8 22
0 1
215 35 37 46
0 1 0 1 0 1 0 1
14 2 9 14 4 3 4 19 210 29 5 6 17 13 20 33
sex = mal sex = mal sex = mal

yes no
yes no yes no
0
831 478 0 0
809 500 827 482
age >= 13 pclass = 3rd
age >= 14 pclass = 3rd age >= 13 pclass = 3rd
0 1
702 173 129 305
0 1 0 1
700 171 109 329 688 163 139 319
pclass = 2nd,3rd sibsp >= 2.5 parch >= 1.5
1
19 214
0 0 0 pclass = 2nd,3rd sibsp >= 2.5 age >= 5.5 pclass = 2nd,3rd sibsp >= 2.5 parch >= 0.5 pclass = 2nd
673 144 29 29 110 91 1

12 223
0 1 1 0 1 0 1
age < 18 age >= 55 age >= 37 age >= 20
677 137 23 34 97 106 661 132 27 31 113 86 26 233
0 1
27 1 2 28
0 0 0 1
548 77 125 67 39 10 71 81
age >= 32 age >= 55 parch >= 1.5 age >= 32 age >= 49 age >= 37 age >= 25 parch < 1.5
0 1 1 0 1 1
19 1 4 33 4 17 22 0 5 31 6 131
age >= 32 age < 49 age < 23 age >= 28 sibsp >= 0.5
0 0 0 0 0 0 0 0 0 1 1
33 1 37 4 16 0 549 88 128 49 93 89 523 82 138 50 58 20 55 66 20 102
0 0 0 0 1
515 76 88 63 23 10 55 54 16 27
age < 29 age < 48 age >= 17 age < 20 age < 48 sibsp >= 0.5 age < 22 age >= 22
0 0 0 0 0 0 0 1
age < 32 age >= 37 age < 23
139 8 25 3 23 4 129 5 40 6 16 0 18 8 0 28
0 1 0 1 0 0 1
130 15 9 18 19 4 4 6 25 13 9 8 7 19
0 0 1 0 0 0 1 1
0 0 1 410 80 103 46 70 85 394 77 98 44 42 20 37 58 20 74
385 61 79 45 30 41
age < 34 age >= 30 age >= 46 age < 22 age < 32 parch < 0.5 age >= 20 age < 27
0 1 0 0 1 0 1 1 0 1 0 1 1 1
380 51 5 10 38 7 9 6 21 35 360 58 7 10 2 7 63 1 1 6 39 9 3 11 5 12 2 28
0
0 0 1 0 0 1 1
41 38
50 22 96 36 68 78 331 76 97 38 32 46 18 46
age >= 32
1
age < 43 parch < 0.5 age >= 27 parch < 0.5 age >= 28 age >= 25
0 7
0 1 0 0 1 0 1 1
0 41 12 9 10 15 0 16 13 9 10 8 0 24 46 8 35
41 31
0 0 1 0 0 1
81 36 20 9 48 69 315 63 88 28 10 11
age < 31
0
7 0
age >= 39 age < 33 parch >= 1.5 age < 34
0
1 0 1 1 0 0 0 1
34 31
3 5 19 3 1 6 22 40 301 53 9 8 5 3 5 8
0 1 0 0
sibsp >= 0.5 78 31 26 29 14 10 79 20
1
2 9
0
32 22 age < 32 age >= 37 age >= 46
0 0 0 1 0
14 3 13 3 8 0 6 10 22 0
age >= 27
0 1 0
0
11 2 64 28 13 26 57 20
0
21 20
age >= 36 age >= 38

0 0 1 0
parch < 0.5 49 16 8 5 5 21 13 0
0
4 3 0 0
15 12 44 20
0
17 17
0 1 0 1
0 1
15 12 2 5 12 6 3 6 41 16 3 4
sex = mal
sex = mal no
yes no
sex = mal
yes
yes no
0
831 478
0
814 495 age >= 4.5 pclass = 3rd
0
778 531
0 1
696 156 135 322
age >= 9.5 pclass = 3rd
pclass = 2nd,3rd pclass = 3rd sibsp >= 2.5 age >= 9.5 pclass = 3rd
1
15 215
0 1 0 1 0
669 166 145 329 689 140 7 16 120 107
0 1
age >= 32 age >= 54 age >= 17 645 176 133 355
0 1 0
7 4 0 12 22 3
pclass = 2nd,3rd sibsp >= 3 sibsp >= 2.5 age >= 57
0 0 1
573 82 116 58 98 104
pclass = 2nd,3rd sibsp >= 2.5 sibsp >= 2.5
age < 32 age < 48 age < 22 1
0 1 0 1 0 0 1
649 133 20 33 129 106 16 223 146 9 27 5 8 20
15 244
0 0 0
427 73 89 53 90 84 0 1 0
632 157 13 19 118 111
age >= 46 age >= 3.5 age >= 17 sibsp >= 0.5 age >= 9.5 age >= 41 age >= 50 age >= 20 parch >= 3.5
0
0 1 0 1 8 8
531 72 2 27 21 1 3 11
0 0 1 0 1
419 65 80 41 9 12 28 7 62 77 age >= 55 age >= 28
0 0 0 1 0 0 1 0
118 61 18 6 108 105 13 212 age < 19 parch < 0.5 age < 19 age >= 28 502 93 12 3 1 16 14 0
0 0 0 1 0 0
11 6 13 2 6 5 3 7 15 0 6 1
0 0 0 1
0 1
parch >= 0.5 sibsp < 0.5 age >= 39 pclass = 2nd 408 59 67 39 13 7 56 76 130 64 104 111
0 1 1 1
13 0 5 6 6 23 5 130 parch < 0.5 age < 33 sibsp < 0.5 age < 26
0 1 0 1
74 2 7 8 10 3 3 4
0 0 0 1 age < 34 age < 31 age < 22
50 14 68 47 102 82 8 82 0
334 57
0
60 31
1
26 28
1
30 48 0
33 3
sibsp >= 0.5 pclass = 2nd age < 38 age < 33 sibsp >= 0.5
0 1 1
age < 48 parch < 0.5 parch >= 0.5 age < 22 age >= 16 51 21 9 10 2 7 0 0 1
0 0 1 0 0 0 1 1
97 61 41 17 63 94
19 0 19 3 1 58 318 49 16 8 13 8 13 20 28 41
0 0 1 0 1 age < 29
0 0 1 0 1 0 1 0 1
31 14 55 28 13 19 83 79 7 24 46 0 12 1 4 7 10 3 3 5 6 2 7 18 4 3 24 38 age >= 37 age >= 39 age >= 20
0 0 0 1
272 49 33 12 15 2 11 25
age >= 55 age < 26 age >= 20 parch < 0.5 age < 30 age >= 28 age >= 30
0 1 0 1 1 0 0 1
10 0 6 9 5 3 8 16 0 10 64 49 26 15 52 69
0 0
0 0 0 1 1 252 41 20 8
21 14 49 19 21 10 62 69 7 14
age < 25
0 0 1 age < 48 age < 15
20 0 16 3 4 5 1 0 1 0
age >= 36 sibsp < 0.5 age >= 24 0 3 12 18 4 8 11 13 2
232 41
0 1 0 0 1 0 1
15 5 6 9 7 0 9 0 10 25 7 4 0 10 0 1
pclass = 3rd
0 61 37 39 67
0 0 0 29 9
42 19 12 10 52 44 0
203 32
age < 20
parch >= 0.5 age >= 7
age < 32 0 0 1
20 8
0 0 1 0 1 47 20 18 47
9 5 9 6 3 4 48 32 4 12 0
183 24
1 0
0 age >= 22
14 17 21 20
33 14 0
11 0
0
172 24
0 1 0 1 0 1
31 8 2 6 0 0 8 1 6 16 12 4 9 16
34 3 30 7

Bagging: Why bootstrapping CART model?
Bagging = Bootstrap aggregating
Prediction:
Regression: average the resulting predictions

Classification: take a majority vote
Impact of bootstrapping:
Averaging a set of observations reduces variance28

It reduces variance and hence increase the prediction accuracy
Compared to CART, the results are much less sensitive to the
original sample, they show impressive improvement in accuracy
Loss of interpretability
28
The variance of the mean of the observations X̄ is given by σ 2 /n
Random Forest: algorithm
Algorithm 2: Random Forest

Select number of trees B, subsampling parameter m, tree
depth D;
for b = 1 to B do
generate a bootstrap sample from the original data;
estimate a tree model on this sample;
for each split do
Randomly select m of the original covariates (m < P);
Split the data with the best covariate (among the m);
end
end
→ Random Forest = Bagging + subsample covariates at each node

→ Bagging is a special case of Random Forest, with m = P

Random Forest = Bagging + subsampling covariates at each node

Random forest: Why subsampling covariates?
Subsampling covariates may sound crazy, it has clever rationale:

Suppose there is one very strong covariate in the sample
Most or all trees will use this covariate in the top split
All of the trees will look quite similar to each other
Hence the predictions will be highly correlated
Averaging many highly correlated quantities does not lead to

a large reduction in variance
Random forests overcome this problem by forcing each split to

consider only a subset of the covariates
→ Random forests decorrelate the trees

√
In practice, default values: m = p/3 in regression and m = p in classification

Random forest: Overfitting
There is no much overfitting in random forests . . .
as B increases: average effect over trees → no overfitting
as D increases: overfitting is argued to be minor

”Segal (2004) demonstrates small gains in performance by controlling the
depths of the individual trees grown in random forests. Our experience
is that using full-grown trees seldom costs much, and results in one less
tuning parameter. Figure 15.8 shows the modest effect of depth control
in a simple regression example.” (Hastie et al., 2009, p.596)
The goal is to grow trees with as little bias as possible. The

high variance that would result from deep trees is tolerated
because of the averaging over a large number of trees
. . . However, a simple example shows that it can be problematic

Random forest: Overfitting . . . a simple example
Let us consider a realistic (simulated) sample
1 set . seed (1)
2 n=200
3 x=r u n i f ( n )
4 y=s i n ( 1 2 ∗ ( x +.2) ) / ( x +.2) + rnorm ( n ) / 2
We can fit CART and random forest models:29

5 f i t . t r <− r p a r t ( y ˜ x ) # CART
6 f i t . ba1 <− r a n d o m F o r e s t ( y ˜ x ) # no d e p t h c o n t r o l
7 f i t . ba2 <− r a n d o m F o r e s t ( y ˜ x , maxnodes =20) # d e p t h c o n t r o l
We can plot observations and predicted values:

8 u=s e q ( min ( x ) , max ( x ) , l e n g t h . o u t =1000)
9 p l o t ( x , y , c o l=” g r a y ” , main=”n=200” )
10 l i n e s ( u , p r e d i c t ( f i t . ba1 , d a t a . f r a m e ( x=u ) ) , c o l=” g r e e n ” )
11 l i n e s ( u , p r e d i c t ( f i t . ba2 , d a t a . f r a m e ( x=u ) ) , c o l=” r e d ” )
12 l i n e s ( u , p r e d i c t ( f i t . t r , d a t a . f r a m e ( x=u ) ) , c o l=” b l u e ” )
We run this code for n = 200 and n = 10000

29
Note that since it is a simple regression, with 1 covariate, then RF=bagging
Random forest: Overfitting . . . a simple example
→ improvement of random forest over a single regression tree

→ overfitting can be very large without controlling tree depth

Random forest: Out-of-bag (OOB) observations
No need to perform cross-validation:
By bootstrapping, each tree uses around 2/3 of the obs. The

remaining 1/3 obs are referred to as the out-of-bag (OOB) obs
Use OOB observations for out-sample predictions
We obtain around B/3 out-sample predictions for the i th obs.
average these values (or majority vote) = OOB prediction for i
An OOB-MSE can be computed over all OOB predictions
The OOB approach for estimating the test error is particularly

convenient with large sample, for which CV would be onerous

Random forest: Tuning parameters
We can use OOB-MSE to tune Random Forest parameters
Depth tree, D: from our previous example, with n = 10000

1 >r a n d o m F o r e s t ( y ˜ x ) $ mse [ 5 0 0 ] # OOB−MSE, no d e p t h c o n t r o l
2 [ 1 ] 0.3252183
3 >maxnode=c ( 1 0 , 5 0 , 1 0 0 , 5 0 0 , 1 0 0 0 , 2 0 0 0 )
4 > f o r ( i i n 1 :NROW( maxnode ) ) { # OOB−MSE, d e p t h c o n t r o l
5 > aa=r a n d o m F o r e s t ( y ˜ x , maxnodes=maxnode [ i ] ) $ mse [ 5 0 0 ] ;
6 > p r i n t ( c ( maxnode [ i ] , aa ) ) }
7 [ 1 ] 10.0000000 0.3747725
8 [ 1 ] 50.0000000 0.2553131
9 [ 1 ] 100.0000000 0.2508479
10 [ 1 ] 500.0000000 0.2570217
11 [ 1 ] 1000.000000 0.268357
12 [ 1 ] 2000.0000000 0.2921307
We can see that OOB-MSE is smaller with maxnode=100
Subsampling parameter, m: can be selected similarly

Random forest: Variable importance
Random forest improves prediction accuracy at the expense of

interpretability . . . the resulting model is difficult to interpret
One can obtain an overall summary of the importance of each

covariates using SSR (regression) or Gini index (classification)
Index: record the total amount that the SSR/Gini is decreased

due to splits over a given covariate, averaged over all B trees
1 > r f <− r a n d o m F o r e s t ( a s . f a c t o r ( s u r v i v e d ) ˜ s e x+a ge+p c l a s s+
s i b s p+p a r c h , d a t a=t i t a n i c 3 , na . a c t i o n=na . o m i t )
2 >i m p o r t a n c e ( r f )
3 MeanDecreaseGini
4 sex 133.75916
5 a ge 63.13448
6 pclass 52.45753
7 sibsp 18.74009
8 parch 17.49320

Random Forests compared to Single Trees (CART)
Source: Breiman (2001)

Advantages:
They tend to work well for problems where there are

important nonlinearities and interactions.
They are robust to the original sample and more efficient than
single trees
Disadvantage:
The results are not intuitive and difficult to interpret.

Bagging and Random forest: Exercise
Consider the dataset used to predict baseball player’s salary:
2 d f=H i t t e r s [ c o m p l e t e . c a s e s ( H i t t e r s $ S a l a r y ) , ]
Create a training set consisting of the first 200 observations, and a

test set consisting of the remaining observations
Perform bagging on the training set for a range of values of the tree
depth D, with B = 1000 trees. Produce a plot with D on the x-axis
and the corresponding test set MSE on the y -axis
Perform random forest on the training set with B = 1000 trees for
several values of the subsampling parameter m, and compute the
corresponding test set MSEs
Compare the test MSE of bagging and random forest to the test
MSE that results from a CART model
Which variables appear to be the most important predictors in the
random forest model?

2. Resampling-based Methods and Algorithms
Classification and Regression Tree (CART)

Boosting
Emmanuel Flachaire Boosting

Boosting: Principle
Like bagging, boosting involves combining a large number of

decision trees, but the trees are grown sequentially
Boosting does not involve bootstrap sampling; instead each
tree is fit on a modified version of the original data set:
Each tree is fit to the residuals from the previous tree model
Each iteration is then focused on improving previous errors 30
Each tree is shallow (low depth): ”weak” classifier/predictor 31
Boosting combines the outputs of many ”weak” learners

(classifiers, predictors) to produce a powerful ”committee”
30
Each subsequent model pays more attention to the errors from previous
models . . . it is a process that learns from past errors
31
Weak classifier: its error rate is only slightly better than random guessing
Boosting for Regression
y ∈ R is a quantitative variable

Boosting: algorithm
Algorithm 3: Boosting for regression trees

Select number of trees B, tree depth D, shrinkage parameter λ;
Set initial predicted values, m̂(x) = 0;
for b = 1 to B do
Compute the residuals, r = y − m̂(x);
Fit a regression tree m̂b (x) of depth D to the data (r , x);
Update the predicted values: m̂(x) ← m̂(x) + λ m̂b (x);
end
→ By fitting trees to the residuals, we seek to improve m̂ in areas

where it does not perform well

Boosting: Overfitting
Number of trees, B:
The role of each new (sequential) tree is to improve the fit

Unlike random forests, boosting can overfit if B is too large 32
Depth of trees, D:
In CART, fully-grown or deep trees are known to overfit

Boosting can then overfit if D is too large
Depth tree is usually very small, by default it is often D = 1
→ B and D can be selected by cross-validation
32
By averaging over a large number of trees, bagging and random forests
reduces variability. Boosting does not average over the trees
Boosting: Shrinkage
Idea behind shrinkage:
Slow down the boosting process to avoid overfitting . . . scale

the contribution of each tree by a factor 0 < λ < 1
A smaller λ typically requires more trees B. It allows more
and different shaped trees to attack the residuals33
→ Small values of D and λ: by fitting small trees to the residuals,

we slowly improve m̂ in areas where it does not perform well 34
→ The boosting approach learns slowly (λ = learning rate)
→ Statistical methods that learn slowly tend to perform well
33
Typical values are λ = 0.01, or λ = 0.001
34
By default, D = 1 and λ = 0.1 in the gbm function in R
Boosting: a simple regression example
1 set . seed (1)
2 n=200
3 x=r u n i f ( n )
4 y=s i n ( 1 2 ∗ ( x +.2) ) / ( x +.2) + rnorm ( n ) / 2
We can fit CART and boosting models:

5 l i b r a r y ( gbm )
6 nb=500
7 # By d e f a u l t : i n t e r a c t i o n . d e p t h=1 and s h r i n k a g e =0.1
8 f i t . bo <− gbm ( y ˜ x , d i s t r i b u t i o n =” g a u s s i a n ” , n . t r e e=nb )
9 f i t . t r <− r p a r t ( y ˜ x )
We can plot observations and predicted values:

8 u=s e q ( min ( x ) , max ( x ) , l e n g t h . o u t =1000)
9 p l o t ( x , y , c o l=” g r a y ” , main=”n=200” , x l a b=NA, y l a b=NA)
10 l i n e s ( u , p r e d i c t ( f i t . t r , d a t a . f r a m e ( x=u ) ) , c o l=” b l u e ” )
11 l i n e s ( u , p r e d i c t ( f i t . bo , d a t a . f r a m e ( x=u ) , n . t r e e s=nb ) , c o l=”
purple ”)
We run this code for n = 200 and n = 10000

Boosting: a simple regression example
→ Boosting provides nice improvement over single regression tree

Boosting: Exercise
Consider the previous simple regression example:

Re-run the code with D = 4, B = 1000 and λ = 1. Do you
observe overfitting?
Perform boosting with different values of D, B and λ and
look how sensitive the results are to these choices
Consider the random forest exercise, on baseball player’s salary:

Perform boosting on the training set for a range of values of
the shrinkage parameter λ, with B = 1000 trees and D = 1.
Produce a plot with different shrinkage values on the x-axis
and the corresponding test set MSE on the y-axis.
Compare the test MSE of boosting to the test MSE that
results from bagging, random forest and CART model

Boosting for Classification
y ∈ {−1, 1} is a qualitative variable

AdaBoost algorithm
Algorithm 4: AdaBoost
Select the number of trees B, and the tree depth D;
Set initial weights, wi = 1/n;
for b = 1 to B do
Fit a classification tree m̂b (x) to the data using weights wi ;
Update the weights: % wi if misclassified, & wi
otherwise † ;
end
Output: ŷi = sign ( B
P
b=1 αb m̂b (x))
† If i is misclassified: wi ← wi e αb , where αb = log( 1−err
err
b
) and errb is the model’s
Pn b
i=1 wi I (yi 6=m̂b (xi ))
misclassification error, errb = Pn
w
. If i is correctly classified wi ← wi .
i=1 i
→ Observ. misclassified have more influence in the next classifier

→ In the output, the contributions from classifiers that fit the data
better are given more weight (a larger αb means a better fit)
Schematic illustration of the boosting framework
(Source: Bishop 2006, Pattern recognition and machine Learning, Figure 14.1)

Boosting vs. bagging
(Source: Internet, @@)
→ Bootstrap samples ≡ Original sample reweighted independently

Illustration of boosting for classification tree
(Source: Bishop (2006), Pattern recognition and machine Learning, Figure 14.2)

Generalizations into a unifying framework
Breiman referred to AdaBoost with trees as the ”best

off-the-shelf classifier in the world” (NIPS Workshop, 1996)
Friedman et al. (2000) show that Adaboost fits an additive

model in a base learner, optimizing a novel exponential loss
function, which is very similar to the binomial log-likelihood
They proposed generalizations into a unifying framework,

which includes several loss functions that can be used
They describe loss functions for regression and classification

that are more robust than squared error or exponential loss
→ Gradient boosting

Stochastic gradient boosting
Algorithm 5: Stochastic gradient boosting
Select number of trees B, tree depth D, shrinkage parameter λ;
for b = 1 to B do
Compute the gradient vector, ri = −∂L(yi , m(xi ))/∂m(xi );
Draw a subset of the original sample (r ∗ , x ∗ );
Fit a regression tree mb (x) of depth D to the data (r ∗ , x ∗ );
Update the predicted values: m(x) ← m(x) + λ mb (x);
end
→ Gradient boosting: Depending the choice of the loss function,

we consider a specific regression or classification model
→ Stochastic:
Shrinkage: slow down the boosting process to avoid overfitting
Subsampling: it reduces the computing time and, in many
cases, it produces a more accurate model (see random forest)
Loss functions in regression, y ∈ R
Squared error loss function:
1
L= (yi − m(xi ))2
2
for which the gradient vector is the residuals ri = yi − m(xi )
Absolute error loss function, or Laplacian:35
L = |yi − m(xi )|
→ median of the conditional distribution . . . robust regression

Huber loss function: a robust alternative to absolute error loss,
(
1
(yi − m(xi ))2 |yi − m(xi )| ≤ δ
L= 2
δ(|yi − m(xi )| − δ/2) |yi − m(xi )| > δ
35
We can also derive a quantile loss function: L = (1 − α)|yi − m(xi )| if
|yi − m(xi )| ≤ 0, and L = (1 − α)|yi − m(xi )| otherwise (α: desired quantile)
Loss functions in regression: A comparison
→ When robustness is a concern, squared error is not the best criteria

Loss functions in classification, y ∈ {−1, 1}
Misclassification loss function:36
L = 1(sign[m(x)] 6= y )
Adaboost loss function:
L = e −ym(x)
Bernouilli loss function, or Binomial deviance:
L = log(1 + e −2ym(x) )
→ Minimizing Adaboost or Bernouilli loss functions leads to the

same solution at the population level . . . not in finite sample
→ Bernouilli loss function is more robust to outliers in finite sample
36
The sign of m(xi ) implies that observations with yi m(xi ) > 0 (< 0) are
classified correctly (misclassified)
Loss functions in classification: A comparison
← misclassified correctly classified →
→ More weight for obs. more clearly misclassified (large negative ym(x))
→ When robustness is a concern, exponential loss is not the best criteria
Tuning parameters
The number of trees B. Unlike bagging and random forests,

boosting can overfit if B is too large, although this overfitting
tends to occur slowly if at all.
The number of splits D in each tree, which controls the

complexity of the boosted ensemble. Often D = 1 works well,
in which case each tree is a stump, consisting of a single split.
The shrinkage parameter λ. This controls the rate at which

boosting learns. Typical values are 0.01 or 0.001, and the
right choice can depend on the problem.37
→ We use cross-validation to select B, D and λ
37
Very small λ can require very large B in order to achieve good performance
Boosting: Interpretation
Single tree are highly interpretable. Linear combinations of

trees must therefore be interpreted in a different way.
Variable importance: using the relative importance of a

variable for a single tree,38 we then average over the trees39
After the most relevant variables have been identified, the

next step is to attempt to understand the nature of the
dependence of the approximation m(X ) on their joint values
Partial dependence plot illustrate the marginal effect of the

selected variables on the response after integrating out the
other variables.
38
The squared relative importance of Xl is the sum of squared improvements
over all internal nodes for which it was chosen as the splitting variable
39
Due to the stabilizing effect of averaging, this measure turns out to be
more reliable than is its counterpart (10.42) for a single tree
Application: spam email
The data for this example consists of information from 4601 email
messages, in a study to try to predict whether the email was spam.
The response variable is binary, with values email or spam, and
there are 57 predictors as described below:
48 quantitative predictors: the percentage of words in the email
that match a given word40
6 quantitative predictors: the percentage of characters in the email
that match a given character (; ! # ( [ $)
Uninterrupted sequences of capital letters: average length (CAPAVE),
length of the longest (CAPMAX), sum of the length (CAPTOT)
→ use gradient boosting to design an automatic spam detector that

could filter out spam before clogging the users’ mailboxes
40
Examples include business, address, internet, free, and george. The
idea was that these could be customized for individual users (Hastie et al, 2009)
Application: spam email

Application: partial dependence
→ effect of Xj on m(X ) after accounting for the average effects of the other variables

Application: joint frequencies
→ This plot displays strong interaction effects

Boosting
Emmanuel Flachaire Support Vector Machine (SVM)

Introduction
A method developped in the computer science community in

the 1990s
It uses a basis expansion to capture non-linear class

boundaries
Well suited for classification of complex but small- or

medium-sized datasets
Often considered one of the best ”out of the box” classifiers

Support Vector Classifier
The separable case

Classification and hyperplane
Source: James et al. (2013)
A hyperplane separates the space in two halves:
β0 + X1 β1 + X2 β2 > 0 (blue) or < 0 (red)
An ∞ number of hyperplanes, with same classification score

What would make a difference is their capacity to generalize

The Maximal Margin Classifier
Margins = two parallel separating hyperplanes, located at the
smallest distance from the observations of each classes
Margins: the dashed lines
Support vectors: the two blue

points and the purple point
that lie on the margins
Optimal hyperplane: solid line
Principle: Maximize the distance between the two margins

The maximal margin (or optimal) hyperplane is the separating
hyperplane that is farthest from the training observations
How to find the maximal margin?
It is an optimization problem:
maximize M subject to yi (Xi β) ≥ M, ∀i = 1, . . . , n

β,kβk=1
Here, kβk = 1 ensures that the perpendicular distance from the i th

observation to the hyperplane is given by yi (Xi β). Thus, the
restriction ensures that each observation is on the correct side of
the hyperplane and at least a distance M from the hyperplane.
It is equivalent to:41
minimize kβk2 subject to yi (Xi β) ≥ 1, ∀i = 1, . . . , n

β
41
We get rid of kβk = 1 by replacing the restriction with yi (Xi β) ≥ Mkβk
and, by setting kβk = 1/M, see Hastie et al (2009, section 4.5.2)
Sensitivity to individual observations
Adding one blue observation leads to a quite different hyperplane,
with a significant decrease of the distance between the two margins
→ It could be worthwhile to misclassify a few training observations

in order to obtain a better generalization (out-sample classification)

Support Vector Classifier (SVC)
Why should we consider a classifier that is not a perfect separator?

In the interest of:
greater robustness to individual observations
better classification of the out-sample observations
Underlying principles:42
SVC: maximal margin classifier, tolerating margin violations
Logit: minimize misclassification error
42
Figures: logit ≈ SVC (left), logit=solid line & SVC=dashed line (right)
How to tolerate margin violations?
It is a slightly modified optimization problem:
maximize M subject to yi (Xi β) ≥ M(1 − i ),

β,kβk=1
Pn
and i ≥ 0, i=1 i ≤ C
∀i = 1, . . . , n, where C is a nonnegative tuning parameter.

1 , . . . , n are slack variables that allow observations to be on
the wrong side of the margin (i > 0) or hyperplane (i > 1)
C is a budget for the amount that the margins can be violated
C = 0: no margin violation is tolerated
as C increases, we become more tolerant of margin violations
C is the maximal number of observations allowed to be on the
wrong side of the hyperplane
In practice, C is a tuning parameter chosen by cross-validation
Support Vector Classifier vs. Logit model
SVC: the previous optimization problem can be rewritten as:43
n
X K
X
minimize max [0, 1 − yi (Xi β)] + λ βk2
β | {z }
i=1 hinge loss function k=1
It’s a minimization of the hinge loss function with penalization

Logit: minimizing missclassification, we have:
n
X
minimize log 1 + e −yi (Xi β) ,
β
i=1 | {z }
softmax function
It’s a minimization of the softmax function, no penalization

→ SVC ≈ penalized Logit model, using a hinge loss function
→ Role of penalization = tradeoff min missclassif. & max margin
43
With λ = 1/(2C ), see Hastie et al. (2009)
SVC and Logit: loss function
Source: James et al. (2013) yi (Xi β)

Overall, the two loss functions have quite similar behavior
Hinge loss = 0 for obs on the correct side of the margin: yi (Xi β) > 1

SVC and Logit: The separable case
Left: Logit ≈ SVC with C = 0

Right: Logit 6≈ SVC with C > 0 (chosen by cross-validation)
→ SVC = tradeoff between min missclassification & max margin
44
44
max margin: pushing away the obs. as far as possible from the hyperplane
min missclassif: smallest aggregated distance from the hyperplane of wrong obs
Support Vector Classifier
The non-separable case

The non-separable case
In the non-separable case, some observations are on the wrong

side of the hyperplane
The Maximal Margin Classifier has no solution
Logit minimizes the aggregated distance from the hyperplane

of the missclassified observations, not the number of missclass.
SVC is a tradeoff between:
minimizing the aggregated distance from the hyperplane of the
missclassified observations
pushing away as far as possible from the hyperplane the
correctly classified observations

SVC and Logit: The non-separable case
Left: Logit 6≈ SVC with C small

Right: Logit 6≈ SVC with C chosen by cross-validation
SVC: 1 mistake - Logit: 3 mistakes
Nonlinear separability

Support Vector Machine (SVM)
Many datasets are not linearly separable
Adding polynomial features and interactions can be used
But a low polynomial degree cannot deal with very complex

datasets
The support vector machine (SVM) is an extension of the

support vector classifier that results from enlarging the feature
space in a specific way, using kernels.
SVM works well for complex but small- or medium-sized

datasets

Moving into higher dimension
Find a SVM classifier to identify teenagers from the height:45

x−150 x−150 2

Using the projection ϕ : x 7→ 10 , 10 , we obtain:
The data are linearly separable in the 2-dimensional space

45
Source: Internet
The kernel trick
The data are not linearly separable in the 2-dimensional space, S
The kernel trick: Source: https://freakonometrics.hypotheses.org/52775
The data are linearly separable in the 3-dimensional space, S 0

SVM: The optimization problem
It is the SVC optimization problem, with transformed covariates:
maximize M subject to yi (ϕ(Xi )β) ≥ M, ∀i

β,kβk=1
n
X K
X
or minimize max [0, 1 − yi (ϕ(Xi )β)] + λ βk2
β
i=1 k=1
In the resolution, ϕ only appears in the form ϕ(Xi )> ϕ(Xj ). Thus,
we don’t need to express explicitely ϕ

we don’t need to express the higher dimension space S 0
We use a kernel function defined as K (x, x 0 ) = ϕ(x)> ϕ(x 0 )

Polynomial kernel
The kernel should be a symmetric positive (semi-) definite function.
The dth-degree polynomial kernel is: K (x, x 0 ) = (1 + hx, x 0 i)d
1st-degree polynomial kernel with two covariates X1 and X2 :46
K (X , X 0 ) = (1 + hX , X 0 i) = (1 + X1 X10 + X2 X20 )
With ϕ(X ) = {1, X1 , X2 }, we have K (X , X 0 ) = ϕ(X )> ϕ(X 0 ).

It corresponds to the linear case, or SVC.
→ SVM with 1st-degree polynomial kernel is similar to SVC
Pn
46
With two n-vectors, the inner product is: hx1 , x2 i = x1> x2 = i=1 xi yi
Polynomial kernel
The dth-degree polynomial kernel is: K (x, x 0 ) = (1 + hx, x 0 i)d
2nd-degree polynomial kernel with two covariates X1 and X2 :
K (X , X 0 ) = (1 + hX , X 0 i)2 = (1 + X1 X10 + X2 X20 )2

= 1 + 2X1 X10 + 2X2 X20 + (X1 X10 )2 + (X2 X20 )2 + 2X1 X10 X2 X20
√ √ √
Here, ϕ(X ) = {1, 2X1 , 2X2 , X12 , X22 , 2X1 X2 } defines a
6-dimensional space, with squared and interaction terms
We move from 3-dimensional space to 6-dimensional space
→ SVM with dth-degree polynomial kernel (d ≥ 2) is similar to

SVC with additional powers and interaction terms of the covariates

Radial kernel
Radial basis function (RBF) kernel: K (x, x 0 ) = exp(−γkx − x 0 k2 )
γ > 0 accounts for the smoothness of the decision boundary47

It is returns values between 0 and 1:
It returns large value for x close to x 0
It returns small value for x far from x 0
It is a similarity measure between two covariates
→ The radial kernel has a local behavior
47
Bias-variance tradeoff: large value of γ leads to high variance (overfitting),
small value leads to low variance and smoother boundaries (underfitting)
Illustration: Simulated data
1 set . seed (1)
2 x=m a t r i x ( rnorm ( 2 0 0 ∗ 2 ) , n c o l =2)
3 x [1:100 ,]= x [1:100 ,]+2
4 x [101:150 ,]= x [101:150 ,] −2
5 y=c ( r e p ( 1 , 1 5 0 ) , r e p ( 2 , 5 0 ) )
6 p l o t ( x [ , 2 ] , x [ , 1 ] , pch =16 , c o l=y ∗ 2 )
Non-linear decision boundaries → SVC will perform poorly

Illustration: Fit SVM with polynomial and radial kernels
We can fit a SVM with 2nd-degree polynomial kernel and fixed

cost of constraints violation:
7 l i b r a r y ( e1071 )
8 d a t=d a t a . f r a m e ( x=x , y=a s . f a c t o r ( y ) )
9 s v m f i t=svm ( y ˜ . , d a t a=dat , k e r n e l=” p o l y n o m i a l ” , c o s t =1 , d e g r e e =2)
10 p l o t ( s v m f i t , dat , g r i d =200)
Or select the cost parameter by 10-fold CV among several values:

11 t u n e . o u t=t u n e ( svm , y ˜ . , d a t a=dat , k e r n e l=” p o l y n o m i a l ” , d e g r e e =2 ,
r a n g e s= l i s t ( c o s t=c ( . 1 , 1 , 1 0 , 1 0 0 ) ) )
12 p l o t ( t u n e . o u t $ b e s t . model , dat , g r i d =200)
13 summary ( t u n e . o u t )
Similarly, we can fit a SVM with radial kernel:48

14 t u n e . o u t=t u n e ( svm , y ˜ . , d a t a=dat , k e r n e l=” r a d i a l ” , r a n g e s= l i s t (
c o s t=c ( . 1 , 1 , 1 0 , 1 0 0 ) , gamma=c ( . 5 , 1 , 2 , 3 , 4 ) ) )
15 p l o t ( t u n e . o u t $ b e s t . model , dat , g r i d =200)
48
We then have 2 tuning parameters, the cost of constraints violation and γ
Illustration: Polynomial vs. radial kernels
Either kernel is capable of capturing the decision boundary

However, the results are different

ROC curve
With more than 2 covariates, we can’t plot decision boundary
We can produce a ROC curve to analyze the results
SVM doesn’t give probabilities to belong to classes, as in logit
We compute scores of the form fˆ(X ) = ϕ(Xi )β̂ for each obs.
Scores = predicted values.
For any given cutoff t, we can classify observations into a

category, depending on wether
fˆ(X ) < t or fˆ(X ) ≥ t
The ROC curve is obtained by computing the false positive

and true positive rates for a range of values of t

Illustration: ROC curves
We write a short function to plot a ROC curve:
16 l i b r a r y (ROCR)
17 r o c p l o t=f u n c t i o n ( p r e d , t r u t h , . . . ) {
18 p r e d o b=p r e d i c t i o n ( p r e d , t r u t h )
19 p e r f=p e r f o r m a n c e ( p r e d o b , ” t p r ” , ” f p r ” )
20 plot ( perf , . . . ) }
We can fit a SVM with radial kernel and plot a ROC curve:
21 set . seed (1)
22 t r a i n=s a m p l e ( 2 0 0 , 1 0 0 )
23 t r a i n=s o r t ( t r a i n , d e c r e a s i n g=TRUE) # t o a v o i d r e v e r s e ROC
24 s v m f i t=svm ( y ˜ . , d a t a=d a t [ t r a i n , ] , k e r n e l=” r a d i a l ” , c o s t =1 ,
gamma=0.5)
25 f i t =a t t r i b u t e s ( p r e d i c t ( s v m f i t , d a t [− t r a i n , ] , d e c i s i o n . v a l u e s=
TRUE) ) $ d e c i s i o n . v a l u e s
26 r o c p l o t ( f i t , d a t [− t r a i n , ” y ” ] , main=” T e s t Data ” , c o l=” r e d )
We can also fit a Logit model and plot a ROC curve:

27 l g t=glm ( y ˜ . , d a t a=d a t [ t r a i n , ] , f a m i l y=b i n o m i a l ( l i n k = ’ l o g i t ’ ) )
28 f i t =p r e d i c t ( l g t , d a t [− t r a i n , ] , t y p e=” r e s p o n s e ” )
29 p a r ( new=TRUE)
30 r o c p l o t ( f i t , d a t [− t r a i n , ” y ” ] , c o l=” g r e e n ” )

Illustration: ROC curves
As expected in this example, SVM outperforms Logit model

Boosting
Emmanuel Flachaire Neural Networks and Deep Learning

Neural networks with one covariate

Looking for a more flexible model . . .
A linear model maybe quite restrictive:
y ≈ α + βx
We can obtain a more flexible model by adding:

successive powers . . . . . . . . . . . . . . . . . . . polynomial regression49
M
X
y ≈α+ βm x m
m=1
nonlinear functions of linear combinations . . neural networks50

M
X
y ≈α+ βm f (αm + δm x)
m=1
where f is an activation function – a fixed nonlinear function

49
y ≈ α + β1 x + β2 x 2 + β3 x 3 + . . .
50
y ≈ α + β1 f (α1 + δ1 x) + β2 f (α2 + δ2 x) + β3 f (α3 + δ3 x) + . . .
Common examples of activation functions
1
The logistic (or sigmoid) function: f (x) = 1+e −x
e x −e −x
The hyperbolic tangent function: f (x) = tanh(x) = e x +e −x
The Rectified Linear Unit (ReLU): f (x) = max(0, x) = (x)+
Source: Géron (2017)

Neural network vs. polynomial: A simple example

1 s e t . s e e d ( 1 ) ; n=200
2 x=s o r t ( r u n i f ( n ) )
3 y=s i n ( 1 2 ∗ ( x +.2) ) / ( x +.2) + rnorm ( n ) / 2
4 d f=d a t a . f r a m e ( y , x )
We can fit a polynomial regression with M = 3:

5 o l s=lm ( y ˜ x+I ( x ˆ 2 )+I ( x ˆ 3 ) )
6 p l o t ( x , y , main=” P o l y n o m i a l : M=3” )
7 l i n e s ( x , p r e d i c t ( o l s ) , c o l=” b l u e ” )
We can fit a neural network model with M = 3:

8 library ( neuralnet )
9 nn=n e u r a l n e t ( y ˜ x , d a t a=d f , h i d d e n =3 , t h r e s h o l d =.05)
10 y f i t =compute ( nn , d a t a . f r a m e ( x ) ) $ n e t . r e s u l t
11 p l o t ( x , y , main=” N e u r a l N e t w o r k s : M=3” )
12 l i n e s ( x , y f i t , c o l=” r e d ” )

Neural network vs. polynomial: A simple example
2
Polynomial: M=3 Neural Networks: M=3
2
1
1
0
0
y
y
−1
−1
−2
−2
−3
−3
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
→ Neural networks can capture nonlinearity

A weighted sum of fixed/adjustable components
1.0 x x2 x3 α1 + β1x + β2x2 + β3x3
1.0
1.0
2
0.8
0.8
0.8
1
0.6
0.6
0.6
0
u^2
u^3
u
y
0.4
0.4
0.4
−1
0.2
0.2
0.2
−2
−3
0.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
u u u x
f(α1 + b1x) f(α2 + b2x) f(α3 + b3x) α + β1f(α1 + δ1x) + β2f(α2 + δ2x) + β3f(α3 + δ3x)
1.0
2
0.8
0.6
0.8
1
0.6
0.6
0.4
0
f1
f2
f3
y
0.4
0.4
−1
0.2
0.2
−2
0.2
−3
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x x x

Fixed vs. adjustable components
Why neural networks perform better than polynomial regression in
the previous example?
Polynomial regression is based on fixed components, or
bases:51
x, x 2 , x 3 , . . . , x M
Neural network is based on adjustable components, or bases:52
f (α1 + δ1 x), f (α2 + δ2 x), . . . , f (αM + δM x)

Adjustable components have tunable internal parameters
They can express several shapes, not just one (fixed) shape
Each component is more flexible than a fixed component
→ Adjustable components enable to capture complex models with

fewer components (smaller M)
51
y ≈α+ M βm x m
P
52 Pm=1
M
y ≈ α + m=1 βm f (αm + δm x)
Neural networks with several covariates

Neural network with several covariates
With a set of covariates X = (1, x1 , x2 , . . . , xk ), we have

M
X
y ≈α+ βm f (αm + X δm )
m=1
The nonlinarity of the activation function f is essential,

otherwise it is a simple linear model in X
Combining several nonlinear functions f is essential to capture
interaction effects, M > 1, otherwise it is just a logit model53
By adding nonlinear functions of linear combinations of X , we

obtain a more flexible model, which is able to capture nonlinearity
and interaction effects
53
With M = 1 and the logistic activation function, it is a logit model
Interaction effects
Adding two nonlinear functions can generate an interaction effect:

2
X
y ≈α+ βm f (αm + x1 δm + x2 γm )
m=1
Let us consider α = α1 = α2 = 0, β1 = −β2 = 41 , α1 = α2 = 0,

δ1 = δ2 = γ1 = −γ2 = 1 and f (z) = z 2 , we have:
1 1
y ≈ 0 + (0 + x1 + x2 )2 − (0 + x1 − x2 )2
4 4
1
≈ [(x1 + x2 )2 − (x1 − x2 )2 ]
4
≈ x1 x2
So the sum of two nonlinear transformations of linear functions can
give us an interaction! Here, we would always get a 2nd-degree
polynomial in X . Other activations do not have such a limitation.
XOR: Exclusive or (true if its arguments differ)
P2
Diagram of y ≈ α + m=1 βm f (αm + x1 δm + x2 γm )
Source: Géron (2017, p.260)
With the step activation function (=1 if positive, 0 otherwise)

y ≈ −0.5 − I(x1 + x2 > 1.5) + I(x1 + x2 > 0.5)
With (0,0) or (1,1) we have -0.5, with (1,0) or (0,1) we have +0.5

Neural network with a single hidden layer

PM
Diagram of y ≈ α + m=1 βm f (αm + X δm ) with M = 5 neurons
Ridge regularization & standardization
NN tends to overfit due to large number of coefficients

A solution is to regularize similar to ridge regression:54
Pp 2
Minimize the SSR subject to j=1 θj ≤c
The results are sensitive to the scale of the covariates

It is best to standardize covariates before using Neural
Networks, so that they are all on the same scale:
x − x̄
p
Var(x)
54
θ is the set of coefficients α, β, δ
Backpropagation algorithm
In 1986, Rumelhart et al. found a way to train neural networks,

with the backpropagation algorithm.55 Today, we would call it a
Gradient Descent using reverse-mode autodiff.
For each training instance:

1 the algorithm first makes a prediction (forward path)
2 measures the error
3 goes through each layer in reverse to measure the error
contribution from each connection (reverse pass)
4 slightly tweaks the connection weights to reduce the error
(Gradient Descent step)
55
Rumelhart et al.: Learning Internal Representations by Error Propagation
Application 1: Mincer equation
1 l i b r a r y (AER) ; d a t a ( ” CPS1985 ” )
2 CPS1985 $ g e n d e r=a s . n u m e r i c ( CPS1985 $ g e n d e r )
3 library ( neuralnet )
4 nn=n e u r a l n e t ( l o g ( wage ) ˜ e d u c a t i o n+e x p e r i e n c e+g e n d e r , d a t a=
CPS1985 , h i d d e n =3 , t h r e s h o l d =.05)
5 p l o t ( nn )
1 1
−2
3.1
44
37
education −1.85739
−00 .456
0.
16
.0 21
57
54
6
94
0.95
1.50
919
42
6
69
31
1.
experience −0.16691 −2.4352 log(wage)

−0
.0
214
2
−2.8359
.6 5
29
−0.6872
5
93
95
19
4
1.
gender −1.54855
Error: 49.601795 Steps: 1609

Application in classification
Logit model Neural network with 10 units
Source: Hastie, Tibshirani and Friedman (2009), based on simulated data
In classification, the softmax function is applied to the outputs

Multilayer neural networks

Even greater flexibility is achieved via composition of activation

functions:
M P
!

(1) (2) (2) (1)
X X
y ≈α+ βm f αm + f αp + X δp δm
m=1 p=1
| {z }
it replaces X
The composition of activation functions puts one additional

hidden layer between inputs and outputs → multi-layers NN
A NN with three hidden layers can be obtained by simply

repeating the procedure used to create the two layer basis.
Multilayer neural networks: when a NN has 2 or more hidden layers


From single layer with many neurons to multilayer with less neurons
James et al. (2021):
In theory a single hidden layer with a large number of

units/neurons has the ability to approximate most functions
However, the learning task of discovering a good solution is
made much easier with multiple layers each of modest size
Modern neural networks typically have more than one hidden
layer, and often many units/neurons per layer
Deep Neural Networks = Multilayer Neural Networks

Application 1: Mincer equation
CPS1985 , h i d d e n=c ( 3 , 3 ) , t h r e s h o l d =.05)
2 p l o t ( nn )
CPS1985 , h i d d e n=c ( 3 , 3 , 3 ) , t h r e s h o l d =.05)
4 p l o t ( nn )
1 1 1 1 1 1 1
62
2.1
6.2
0.3
−5
.0
26
26
95
19
.63
08
54
07
36
6
education 2.53567 −0.44262 2.79195 4.72269 0.59622

10.
−−0
0.8
−1.
−0.
1−1
−1
.0 1837
.7 .1967
18 .22332
0.3 125
0.8 967
.6
01
38
57
52
63
.2
59
66
33
96
15
1
15
95
98
78
−0.63
34.09
−11.5
1.502
0.398
0.920
1.232
62
27
534
153
29
05
44
75
86
09
18
76
73
7
36
77
9
7.8
0.9
98
.6
.8
2.8
−1
−1
−5
−2
experience 0.03246 −1063.56013 −0.6148 log(wage) −0.02541 −2.30108 −1.05663 1.71383 log(wage)
0.2
1.1
−3
−2
35
.2
.7
0.1
09
31
15
86
39
39
45
33
3
85
−24.08
−4.652
−0.890
−1.314
12.061
81
7.7 42
4.7 62
87
37
50 8
126
26
63
−3 .236
−1 0.16
−1 .661
1.04174
28 0155
83
28
81
14
93
94
49
07
42
43
74
16
02
.8
.6
−8
−2
19
4.8
2.
0.
0.4
gender 1.88576 0.69849 −3.08678 0.22792 93.02464
Error: 47.665751 Steps: 23831 Error: 50.041345 Steps: 5234

Pattern recognition
Everything is just numbers:
Source: internet, link
A 18x18 pixel image can be seen as an array of 324 numbers that

represent how dark each pixel is (grayscale value in (0, 255))
A vector of these numbers can be used to feed a neural networks
MNIST handwritten digit dataset

Input vector X : p = 28 × 28 = 784 pixels
Output Y : class label Y = (Y0 , Y1 , . . . , Y10 )
60,000 training images and 10,000 test images
Application 2: Handwritten digit recognition
1 # S o u r c e : s e c t i o n 1 0 . 9 . 2 i n James e t a l . ( 2 0 2 1 )
2 l i b r a r y ( keras )
3 # l o a d t h e MNIST d i g i t d a t a
4 m n i s t <− d a t a s e t m n i s t ( )
5 x t r a i n <− m n i s t $ t r a i n $ x
6 g t r a i n <− m n i s t $ t r a i n $ y
7 x t e s t <− m n i s t $ t e s t $ x
8 g t e s t <− m n i s t $ t e s t $ y
9 # reshape images i n t o m a t r i c e s
10 x t r a i n <− a r r a y r e s h a p e ( x t r a i n , c ( nrow ( x t r a i n ) , 7 8 4 ) )
11 x t e s t <− a r r a y r e s h a p e ( x t e s t , c ( nrow ( x t e s t ) , 7 8 4 ) )
12 y t r a i n <− t o c a t e g o r i c a l ( g t r a i n , 1 0 )
13 y t e s t <− t o c a t e g o r i c a l ( g t e s t , 1 0 )
14 # r e s c a l e to the u n i t i n t e r v a l
15 x t r a i n <− x t r a i n / 255
16 x t e s t <− x t e s t / 255
17 # d e f i n e t h e m u l t i l a y e r NN
18 modelnn <− k e r a s model s e q u e n t i a l ( )
19 modelnn %>%
20 l a y e r dense ( u n i t s = 256 , a c t i v a t i o n = ” r e l u ” ,
21 i n p u t s h a p e = c ( 7 8 4 ) ) %>%
22 l a y e r d r o p o u t ( r a t e = 0 . 4 ) %>%
23 l a y e r d e n s e ( u n i t s = 1 2 8 , a c t i v a t i o n = ” r e l u ” ) %>%
24 l a y e r d r o p o u t ( r a t e = 0 . 3 ) %>%
25 l a y e r dense ( u n i t s = 10 , a c t i v a t i o n = ” softmax ” )
26 summary ( modelnn )
27 # add d e t a i l s t o t h e model
28 modelnn %>% c o m p i l e ( l o s s = ” c a t e g o r i c a l c r o s s e n t r o p y ” ,
29 o p t i m i z e r = o p t i m i z e r rmsprop ( ) , m e t r i c s = c ( ” a c c u r a c y ” )
30 )

Application 2: Handwritten digit recognition
31 # f i t t h e NN w i t h t r a i n i n g d a t a
32 system . time (
33 h i s t o r y <− modelnn %>%
34 f i t ( x t r a i n , y t r a i n , epochs = 30 , batch s i z e = 128 ,
35 validation s p l i t = 0.2)
36 )
37 p l o t ( h i s t o r y , smooth = FALSE )
38 # obtain the t e s t e r r o r
39 a c c u r a c y <− f u n c t i o n ( p r e d , t r u t h )
40 mean ( d r o p ( p r e d ) == d r o p ( t r u t h ) )
41 modelnn %>% p r e d i c t ( x t e s t ) %>% max . c o l %>% a c c u r a c y ( g t e s t +1)
42
43 # f i t a m u l t i n o m i a l l o g i t a s a NN w i t h o u t h i d d e n l a y e r
44 m o d e l l r <− k e r a s model s e q u e n t i a l ( ) %>%
45 l a y e r dense ( i n p u t shape = 784 , u n i t s = 10 ,
46 a c t i v a t i o n = ” softmax ” )
47 summary ( m o d e l l r )
48 m o d e l l r %>% c o m p i l e ( l o s s = ” c a t e g o r i c a l c r o s s e n t r o p y ” , o p t i m i z e r = o p t i m i z e r
rmsprop ( ) , m e t r i c s = c ( ” a c c u r a c y ” ) )
49 m o d e l l r %>% f i t ( x t r a i n , y t r a i n , e p o c h s = 3 0 , b a t c h s i z e = 1 2 8 , v a l i d a t i o n
s p l i t = 0.2)
50 m o d e l l r %>% p r e d i c t ( x t e s t ) %>% max . c o l %>% a c c u r a c y ( g t e s t +1)
You may need to install Keras first:

1 i n s t a l l . packages ( ” t e n s o r f l o w ” )
2 i n s t a l l . packages ( ” keras ” )
3 l i b r a r y ( keras )
4 tensorflow : : i n s t a l l tensorflow ()
5 tensorflow : : tf config ()
6 i n s t a l l keras ()

Multilayer NN for handwritten digit recognition
NN with 2 hidden layers L1 (256 units) and L2 (128 units)

235,146 coef in the NN and 7,065 in the multinomial logit56
To avoid overfitting, two forms of regularization are used
56
L1 : 785×256=200,960 and L2 : 257×128=32,896 and 10-outputs 129×10
Dropout regularization
New efficient form of regularization, inspired by random forest

Randomly remove a fraction of the units in a layer
In practice, randomly set the dropped out units to zero

Limitations
Multilayer NN can model complex non-linear relationships

With very complex problems, such as detecting hundreds of types
of objects in high-resolution images, we need to train deeper NN:
perhaps 10 layers, each with hundreds of neurons, connected

by hundreds of thousands of connections
training a fully-connected DNN is very slow
severe risk of overfitting with millions of parameters
gradients problems make lower layers very hard to train
Solutions:
Convolutional Neural Networks (CNN or ConvNets)

Recurrent Neural Networks (RNN)

Convolutional Neural Networks

Pattern recognition
The network fails to recognize ’8’ when the letter is not centered
→ translation, scale and (small) rotation invariances are needed
The solution is convolution

Convolutional Neural Network (CNN or ConvNet)58
Step 1: Break the image into overlapping image tiles and, feed
each image tile into a small neural network with the same weights57
→ It remains to use a sliding window over the entire picture

→ using the same small NN reduces the number of weights
→ same neural networks weights ≡ filter or convolution kernel
57
and the same activation function, ReLU=max(0,input), tanh or sigmoid
58
Source: Adam Geitgey link , Ujjwal Karn link , Andrej Karpathy link
CNN: The convolution step 1
CNN exploit spatially local correlation: each neuron is locally-
connected (to only a small region of the input volume)
→ Different values of weights will produce different feature maps

→ The convolution step plays like a filter
→ Different filters can detect different features from an image59
59
as for instances edges, curves, . . .
CNN: The convolution step 1

CNN: The pooling step 2
Step 2: Reduce the size of the array, using a pooling algorithm.
2x2 pooling layer, no padding

The pooling step reduces the dimensionality of each feature map

but retains the most important information60
Pooling can be of different types: Max, Average, Sum etc.
60
It is also called subsampling or downsampling step
CNN: The pooling step 2
The function of pooling is to progressively reduce the spatial size

of the input representation. In particular, pooling:
makes the input representations smaller and more manageable
reduces the number of weights and links in the network,
therefore, controlling overfitting
makes the network invariant to small transformations,
distortions and translations in the input image61
helps us arrive at an almost scale invariant representation of
our image62
61
a small distortion in input will not change the output of Pooling – since we
take the maximum/average value in a local neighborhood
62
This is very powerful since we can detect objects in an image no matter
where they are located
CNN: The classification step 3
Step 3: Make a final prediction with a fully-connected network
Feature extraction: use even more steps (hidden-layers) to extract

the useful features from the images.The more convolution steps
you have, the more complicated features your network will be able
to learn to recognize.
Classification: The purpose of the Fully Connected layer is to use
the high-level features for classifying the input image into classes
CNN: Intuitive principle
A hierarchy of representations with increasing level of abstraction:
Source: Internet
• Extract local features that depend on small subregions of the image

• Information from these features are merged to detect higher-order features
→ construction of complex objects from elementary parts

Image recognition: pixel → edge → texton → motif → part → object
Text: character → word → word group → clause → sentence → story
Speech: sample → spectral band → sound → . . . → phoneme → word

CNN: An example on number recognition
To understand how ConvNet works, play with this animation link

CNN: Performance in practice
Source: Hastie et al. (2016)

Convolutional Neural Networks outperform other methods
The number of weights in Net-5 is much less than in Net-1
ConvNet has been ”a revolution in Artificial Intelligence”
See the inaugural lesson of Yann LeCun at the College de France, in
English en or in French fr , and the review paper in Nature pdf

CNN: Detection in complex cases
See this animation link

CNN: Detection in complex cases
Source: Ren et al. (2016), https://arxiv.org/pdf/1506.01497v3.pdf

Recurrent Neural Networks

Nature of the data
Many data sources are sequential in nature:
In text analysis, the sequence and relative position of words

capture the narrative, theme and tone → Document
classification, sentiment analysis and language translation
Time series of temperature, rainfall, wind speed, air quality

and so on → Weather forecast
Time series of market indices, stock and bon prices and

exchange rates → Financial forecasting
In Recurrent Neural Network, the input object X is a sequence

Recurrent Neural Networks (RNN)
Neural network with a single hidden layer, for t = 1, . . . , T :
yt ≈ α + At β
At = [f (α1 + Xt δ1 ), . . . , f (αM + Xt δM )]
A linear combination of a nonlinear fct of linear combinations of Xt
Recurrent neural network:
Each time series provides many short mini-series of input
sequences X = {X1 , . . . , XL } of L periods, and a target Y
At = [f (α1 + Xt δ1 + At−1 γ1 ), . . . , f (αM + Xt δM + At−1 γM )]
Identical weights for each sequence: α, δ, γ independent of t

A form of weight sharing similar to the use of filters in CNN

Recurrent Neural Networks

RNN: Number of lags, training and test data
Recurrent neural network with one hidden layer, for T = 1, . . . , T :
yt ≈ α + At β
Past values of yt and other covariates can be used in Xt

Select a number of lags L with care, perhaps using CV
Extract many short series of (y , X ) with a predefined length L
Each short serie can be used to predict one value yt
The training data consists of n separate series of length L
The test data consists of the remaining series of length L
Find the set of coefficients minimizing the SSR (subject to a
constraint) based on the training and test data
RNN: Historical trading time series on the NYSE
→ forecast (log) trading volume

Emmanuel Flachaire over 1980-86
Neural based
Networks and on past history
Deep Learning
RNN: Forecast trading volume based on past history

RNN: Autocorrelation function
T = 6051, L = 5, so 6046 short series (y , X ) are available

fit the model with 12 neurons and using 4281 training series
forecast 1765 values after January 2, 1980
RNN: Forecast of log trading volume on the NYSE
See section 10.9.6 in James et al. (2021) for details of the implementation in R
RNN and AR models
Recurrent neural network with a single hidden layer:
yt ≈ α + At β
Lag of the dependent variable yt−1 can be used in Xt

With M = 1, f linear and Xt = yt−1 , we have an AR(L):
yt ≈ β0 + yt−1 β1 + · · · + yt−L βt−L
RNN and AR models have much in common

By combining nonlinear functions (M > 1 and f nonlinear),
RNN add more flexibilty → nonlinear and interaction effects

LSTM: Long and Short Term Memory model
My son is a manga fan, so our next holiday will be in . . .
RNN don’t predict Japan, since it doesn’t remember manga

RNN main limitation: short term memory
Solution: Combine 2 hidden layers, one with short memory
and the other one with longer memory
LSTM combine a long-term state c and a short-term state h

LSTM vs. RNN

LSTM
Géron (2017)
c: drop some memories ⊗ and add some new memories ⊕

3. Using ML methods in Econometrics
Causal inference
Emmanuel Flachaire Using ML methods in Econometrics

General Principle

n
X
m | {z }
i=1
| {z }
Choice of the loss function:

L → conditional mean, quantiles, classification
m → linear, splines, tree-based models, neural networks
Choice of the penalization:

`q → lasso, ridge
λ → over-fitting, under-fitting, cross validation

Ridge and Lasso
n
X p
X
2
Minimize (yi − Xi β) + λ |βj |q
β
i=1 j=2
Pp q
The constraint restricts the magnitude of the coefficients

It shrinks the coefficients towards zero as c & (or λ %)
Add some bias if it leads to a substantial decrease in variance
q = 2: Ridge, β̂ = (X > X + λIn )−1 X > y is defined with p n
q = 1: Lasso sets some coef exactly to 0, variable selection
→ High-dimensional problems (p n)

Random Forest, Boosting, Deep learning
n
X Z
2
Minimize (yi − m(Xi )) + λ m00 (x)2 dx
m
i=1
m00 (x)2 dx ≤ c
R
It is equivalent to minimize SSR subject to
A fully nonparametric model: y ≈ m(X1 , . . . , Xp )

The constraint restricts the flexibility of m
Choice of m: Random forest, boosting or deep learning
Similar to nonparametric econometrics (splines)
Appropriate with many covariates (no curse of dimensionality)
→ Complex functional form

Why and how to use ML methods in Econometrics?
Pros:
High-dimensional problems
Complex functional forms
However,
Black-box models
Prediction is not causation

Causal inference
Emmanuel Flachaire Misspecification detection

A major criticism to econometrics
Léo Breiman (Statistical Science, 2001):
. . . an uncritical use of data models.

Misspecification can lead to wrong conclusions
Let us assume that the true regression function is:
y = β0 + β1 x + β3 x 3 + ε (5)
A parametric test of the following hypotheses:
H0 : y = β0 + β1 x + ε vs. H1 : y = β0 + β1 x + β2 x 2 + ε
may not reject the null, since β2 = 0 is true in (5)
To the opposite, a test statistic based on
H0 : y = β0 + β1 x + ε vs. H1 : y = m(x) + ε
would likely reject the null
A nonparametric model is more appropriate under H1

How machine learning tools may help econometrics?
Parametric model:
y = Xβ + ε
Fully-nonparametric model:
y = m(X ) + ε
Is the parametric regression model correctly specified?

If no, ML methods should outperform OLS estimation
If yes, ML methods should not outperform OLS estimation
ML can be used to detect misspecification

Application 1: Boston housing prices
Boston housing dataset: 14 variables (2 dummies), 506

observations63
OLS in a linear regression model, p=13
medv = X β + ε
Lasso with squares, cubes and pairwise interactions, p=117
medv = X β1 + X 2 β2 + X 3 β3 + (X :X )β4 + ε
Random Forest and Boosting in a nonparametric model, p=13
medv = m(X ) + ε
We compute the MSE by 10-folds Cross-Validation
63
X = [chas,nox,age,tax,indus,rad,dis,lstat,crim,black,rm,zn,ptratio]
1 l i b r a r y (MASS) ; l i b r a r y ( r a n d o m F o r e s t ) ; l i b r a r y ( gbm ) ; l i b r a r y ( g l m n e t )
2 d a t a ( B o s t o n ) ; n o b s=nrow ( B o s t o n )
3 s e t . s e e d ( 1 2 3 4 5 ) ; n f o l d =10
4 K f o l d=c u t ( s e q ( 1 , n o b s ) , b r e a k s=n f o l d , l a b e l s =FALSE )
5 mse . t e s t=m a t r i x ( 0 , n f o l d , 4 )
6 # g e n e r a t e Xˆ2 Xˆ3 and p a i r w i s e i n t e r a c t i o n s f o r t h e L a s s o
7 X c o l=c o l n a m e s ( B o s t o n ) [ −14]
8 X s q r=p a s t e 0 ( ” I ( ” , Xcol , ” ˆ 2 ) ” , c o l l a p s e=”+” ) # s q u a r e d c o v a r i a t e s
9 Xcub=p a s t e 0 ( ” I ( ” , Xcol , ” ˆ 3 ) ” , c o l l a p s e=”+” ) # c u b i c c o v a r i a t e s
10 f m l a=p a s t e 0 ( ”medv˜ ( . ) ˆ2+” , Xsqr , ”+” , Xcub )
11 X=model . m a t r i x ( a s . f o r m u l a ( f m l a ) , d a t a=B o s t o n ) [ , −1]
12 y=B o s t o n [ , 1 4 ]
13 mysample=s a m p l e ( 1 : n o b s ) # random s a m p l i n g ( p e r m u t a t i o n )
14 f o r ( i i n 1 : n f o l d ) { # K−f o l d CV
15 c a t ( ”K−f o l d l o o p : ” , i , ”\ r ” )
16 t e s t=mysample [ w h i c h ( K f o l d==i ) ]
17 t r a i n=mysample [ w h i c h ( K f o l d !=i ) ]
18 # OLS , L a s s o , Random F o r e s t , B o o s t i n g
19 f i t . lm <− lm ( medv˜ . , d a t a=Boston , s u b s e t=t r a i n )
20 f i t . l a <− c v . g l m n e t (X [ t r a i n , ] , y [ t r a i n ] , a l p h a =1)
21 f i t . r f <− r a n d o m F o r e s t ( medv˜ . , d a t a=Boston , s u b s e t=t r a i n , mtry =6)
22 f i t . bo <− gbm ( medv˜ . , d a t a=B o s t o n [ t r a i n , ] , d i s t r i b u t i o n =” g a u s s i a n ” ,
i n t e r a c t i o n . d e p t h =6)
23 # out−s a m p l e MSE
24 mse . t e s t [ i , 1 ] = mean ( ( B o s t o n $medv−p r e d i c t ( f i t . lm , B o s t o n ) ) [− t r a i n ] ˆ 2 )
25 mse . t e s t [ i , 2 ] = mean ( ( y−p r e d i c t ( f i t . l a , X , s=” lambda . min ” ) ) [− t r a i n ] ˆ 2 )
26 mse . t e s t [ i , 3 ] = mean ( ( B o s t o n $medv−p r e d i c t ( f i t . r f , B o s t o n ) ) [− t r a i n ] ˆ 2 )
27 mse . t e s t [ i , 4 ] = mean ( ( B o s t o n $medv−p r e d i c t ( f i t . bo , B o s t o n ) ) [− t r a i n ] ˆ 2 )
28 }
29 mse=c o l M e a n s ( mse . t e s t ) # t e s t e r r o r
30 r o u n d ( mse , d i g i t s =2)
31 [ 1 ] 23.93 14.88 10.16 10.34

Boston housing dataset:64

b 10−CV
R ols lassox 2 x 3 int r.forest boosting
MSE 23.93 14.88 10.16 10.34
Random Forest and Boosting show impressive improvement

over OLS, in terms of predictive performance
ML models are known to capture complex functional forms
It suggests that the parametric model lacks important
nonlinear and/or interaction effects
Lasso provides substantial improvement over OLS, but is still
less performant than Random Forest and Boosting. It
suggests that some nonlinearities are still not well captured.
64
14 variables (2 dummies), 78 pairwise interactions, 506 observations
GamLa: An econometric model for interpretable ML
A partially linear model:
y = g1 (X1 ) + . . . + gp (Xp ) + Z γ + ε
with Z a matrix of pairwise interactions Z = (X1 X2 , . . . , Xq−1 Xq ).

The marginal effect is:
∂y
= gj0 (Xj ) + c
∂Xj
where c is a constant term which depends on the other covariates.

Combine non-linearity in Xj and linear pairwise interactions
The linearity assumption on interaction effects represents the
price to pay to keep the model interpretable.
→ GamLa = GAM + variable selection (Lasso, Autometrics)65

65
Flachaire, Hacheme, Hué, Laurent (2022)
GamLa: An econometric model for interpretable ML
A partially linear model:
y = g1 (X1 ) + . . . + gp (Xp ) + Z γ + ε
Estimation based on the Double Residuals (DR) method:

1 GAM of y on X1 , . . . , Xp : compute the residuals η̂y
2 GAM of Zj on X1 , . . . , Xp : compute the residuals η̂zj , ∀j
3 LASSO of η̂y on η̂z1 , . . . , η̂zl → obtain γ̂
An application of FWL to semiparametric regression models
√
Robinson (1988) shows that with DR γ̂ols is n-consistent,
even if ĝ1 (X ), . . . , ĝp (X ) are consistent at slower rates
Flachaire, Hacheme, Hué and Laurent (2022) show that using
the DR approach is crucial to select correctly the
interactions66
66
So don’t use the gamlasso function in the R package plsmselect!
Boston housing dataset:

b 10−CV
R ols lassox 2 x 3 int r.forest gamla
MSE 23.93 14.88 10.16 9.73
GamLa shows impressive improvement over OLS, in terms of

predictive performance
GamLa performs as well as Random Forest and Boosting67
It suggests that parametric models are outperformed by ML
models when they lack important nonlinear and/or interaction
effects only
67
Model Confidence Set (MCS) test can be used to test if the MSE are
significantly different (Hansen, Lunde and Nason 2011) pdf Pairwise AUC
can be used in classification (Candelon, Dumitrescu and Hurlin 2012) pdf
Conclusion
Many results report that ML outperform parametric models in

terms of predictive performance
ML models outperform standard parametric model ... which

are not well-specified!
ML methods can help to detect and correct misspecification in
parametric regression
Parametric models can perform as well as ML models!

Causal inference
Emmanuel Flachaire Causal Machine learning

Prediction is not causation
Kleinberg et al. (2015) Prediction policy problems

Many policy applications where causal inference is not central
Hips or knees replacement: costly, painful, recovery takes time
Policy decision: predicting the riskiest patients68
Athey (2017) Beyond prediction: Using big data for policy problems
Pure prediction methods are not helpful for causal problems
Which patients should be given priority to receive surgery?
Estimating heterogeneity in the effect of surgery is required
68
ML are used to predict the probability that a candidate would die within a
year from other causes. Identify high risk patients who shouldn’t receive surgery
High-dimensional parametric framework

Inference on target regression coefficients
Our main concern is the estimation and inference on α in a

high-dimensional framework:
y = dα + X β + ε
d is a target regressor as treatment, policy or other variable

X may contain many variables, a few of them are important
With sparsity, a variable selection method is used in a 1st step
Since Lasso shrinks coefficients towards zero, coef are biased
Correct this bias using an additional (unrestricted) estimation
→ Post-selection estimation and inference

The problem of post-selection inference
Single Selection: OLS of y on d and the selected variables X ∗
y = dα + X ∗ β + ε
Unbiased α̂ . . . if the true model is selected only!

Problem: mistakes from the variable selection can introduce
omitted variable bias
one covariate Xj strongly correlated to d without a strong
effect on y may be omitted in the variable selection process
Ignoring variable selection uncertainty may be misleading
→ Naive post-selection estimation may be biased

Post-selection inference: Double selection
Our main concern is estimation and inference on α in
y = dα + X β + ε
Double Selection:69
1 Lasso of y on X : select variables important to predict y
2 Lasso of d on X : select variables important to predict D
OLS of y on d and the union of the selected variables
y = dα + X ∗∗ β + ε
Idea: give a 2nd chance to omitted variables in the first Lasso

α̂ is immunized against variable selection mistakes
→ valid post-selection inference in high-dimensions
69
Belloni, Chernozhukov and Hansen (2014) pdf Uniformly valid confidence
set for α despite imperfect model selection, and full efficiency for estimating α
Post-selection inference: Partialling out
Our main concern is estimation and inference on α in

y = dα + X β + ε
Partialling out:
1 Lasso of y on X : compute the residuals η̂y
2 Lasso of d on X : compute the residuals η̂d
OLS of η̂y on η̂d (double residuals approach)
η̂y = η̂d α + ε
Idea: an application of the Frisch-Waugh-Lovell theorem70
→ Partialling out and double selection are quite similar71

70
But α̂ols is different in the two models due to lasso variable selections
71
From the FLW theorem, the double selection estimator of α is equal to the
OLS estimator of the residuals of y on X ∗∗ on the residuals of d on X ∗∗ .
Threshold selection: Rigorous Lasso
The choice of the penalization parameter λ is crucial

Optimal λ for prediction and estimation are different
CV targets prediction and lacks theoretical foundations
Theoretical grounded and feasible selection for estimation:72
√
λ = 2c nσ̂Φ−1 (1 − γ/(2p))
in the case of homoskedasticity

Another selection is proposed in the heteroskedasticity case
72
See Belloni, Chernozhukov and Hansen (2014) pdf c = 1.1 for
post-Lasso and c = 0.5 for Lasso, γ = .1 by default
Bias of naive post-selection estimation
Source: Belloni, Chernozhukov, Hansen (2014)

Application 1: Do poor countries catch up rich countries?
We are interested in the convergence hypothesis α < 0 in
y = dα + X β + ε
where y is the growth rate of GDP, d is the initial level of GDP

and X contains many countries characteristics
The parameter of interest is α

We test the null hypothesis H0 : α = 0
If H0 is rejected and α < 0: evidence of catch-up effect
Covariate selection is crucial, since p = 63 and n = 90
We use double selection and partialling out with rigorous Lasso
Implementation is done with the R package hdm73
73
see the vignette of the hdm package in R pdf

1 l i b r a r y ( hdm )
2 d a t a ( ” GrowthData ” ) # t h e 2 nd column i s a v e c t o r o f one #
3 y=a s . m a t r i x ( GrowthData ) [ , 1 , d r o p=F ]
4 d=a s . m a t r i x ( GrowthData ) [ , 3 , d r o p=F ]
5 X=a s . m a t r i x ( GrowthData ) [ , − c ( 1 , 2 , 3 ) , d r o p=F ]
6 # f i t models
7 LS . f i t =lm ( y ˜d+X)
8 PO . f i t = r l a s s o E f f e c t (X , y , d , method=” p a r t i a l l i n g o u t ” )
9 DS . f i t = r l a s s o E f f e c t (X , y , d , method=” d o u b l e s e l e c t i o n ” )
10 # i n f e r e n c e on c o e f o f i n t e r e s t
11 LS=summary ( LS . f i t ) $ c o e f f i c i e n t s [ 2 , ]
12 PO=summary (PO . f i t ) $ c o e f f i c i e n t s [ 1 , ]
13 DS=summary (DS . f i t ) $ c o e f f i c i e n t s [ 1 , ]
14 r b i n d ( o l s=LS , d o u b l e . s e l e c t i o n=DS , p a r t i a l l i n g . o u t=PO)
15 Estimate Std . E r r o r t v a l u e Pr ( >| t | )

16 ols −0.009377989 0 . 0 2 9 8 8 7 7 3 −0.31377 0 . 7 5 6 0 1
17 double . s e l e c t i o n −0.050005855 0 . 0 1 5 7 9 1 3 8 −3.16665 0 . 0 0 1 5 4
18 p a r t i a l l i n g . out −0.049811465 0 . 0 1 3 9 3 6 3 6 −3.57420 0 . 0 0 0 3 5

Inference on the parameter of interest α:

Estimate Std.Error t value Pr(>|t|)
OLS -0.00938 0.02989 -0.31377 0.75601
Double selection -0.05001 0.01579 -3.16665 0.00154
Partialling out -0.04981 0.01394 -3.57420 0.00035
H0 : α = 0 not rejected with OLS (large standard error)74

H0 : α = 0 rejected with double selection and partialling out
- more precise estimate (smaller standard error)
- greater magnitude of the coefficient
Poor countries tend to catch up rich countries!
Note that Single Selection (naive post-selection) put α = 0:
15 r l a s s o ( y ˜d+X , p o s t=TRUE) $ c o e f f i c i e n t s [ 2 ]
74
It is not surprising given that p = 63 is comparable to n = 90.
Heterogeneous treatment effects: high-dimensions
If d is a treatment, we can consider heterogeneous effects as
y = dα(X ) + g (X ) + ε
where α(X ) and g (X ) are approximated by linear combinations of

X or transformations of X , α(X ) = Z1 β and g (X ) = Z2 γ.75
The regression can be rewritten: y = dZ1 β + Z2 γ + ε
Several variables of interest β
Double Selection:
1 Lasso of y on Z2 : select variables important to predict y
2 Lasso of each interaction dZ1 on Z2 : select important variables
OLS of y on d and the union of the selected variables
→ assess heterogeneity with many determinants

75
Z1 and Z2 may include powers, b-splines, or interactions of X
Application 2: The effect of gender on wage
Several parameters of interest:
y = dα + dX β + Z γ + ε
y is the log of the wage, d is a dummy for female
dX are the interactions between d and each covariate in X
Z includes 2-ways interactions of the covariates Z = [X , X :X ]
The target variable is female d, in combination with other
variables dX
Our main interest is to make inference on α and β
If β = 0: homogeneous wage gender gap given by α
If β = 0: heterogeneous wage gender gap explained by X
Data: US Census in 2012, p = 116 and n = 2921776
76
for a recent application see Bach, Chernozhukov and Spindler (2021) pdf

1 l i b r a r y ( hdm )
2 data ( cps2012 )
3 y <− c p s 2 0 1 2 $ lnw
4 X <− model . m a t r i x ( ˜−1+f e m a l e+f e m a l e : ( widowed+d i v o r c e d+
s e p a r a t e d+n e v e r m a r r i e d+h s d 0 8+h s d 9 1 1+h s g+cg+ad+mw+s o+we+
e x p 1+e x p 2+e x p 3 ) +(widowed+d i v o r c e d+s e p a r a t e d+n e v e r m a r r i e d
+h s d 0 8+h s d 9 1 1+h s g+cg+ad+mw+s o+we+e x p 1+e x p 2+e x p 3 ) ˆ 2 , d a t a
= cps2012 )
5 X<−X [ , w h i c h ( a p p l y (X , 2 , v a r ) !=0 ) ] #e x c l u d e constant v a r i a b l e s
6 i n d e x . g e n d e r <− g r e p ( ” f e m a l e ” , c o l n a m e s (X) )
7 e f f e c t s . f e m a l e<− r l a s s o E f f e c t s ( x=X , y=y , i n d e x=i n d e x . g e n d e r )
8 summary ( e f f e c t s . f e m a l e )
Generic approach to generate all covariates:

9 X c o l=c o l n a m e s ( c p s 2 0 1 2 ) [ 4 : 1 8 ]
10 d c o l=c o l n a m e s ( c p s 2 0 1 2 ) [ 3 ]
11 Xvar=p a s t e ( Xcol , c o l l a p s e = ”+” )
12 X i n t=p a s t e ( ” ( ” , p a s t e ( Xcol , c o l l a p s e=”+” ) , ” ) ˆ2 ” , s e p=” ” )
13 f m l a=p a s t e ( ” ˜−1+” , d c o l , ”+” , d c o l , ” : ( ” , Xvar , ” )+” , X i n t , s e p=” ” )
14 X<−model . m a t r i x ( a s . f o r m u l a ( f m l a ) , d a t a=c p s 2 0 1 2 )

15 > summary ( e f f e c t s . f e m a l e )
16 [ 1 ] ” E s t i m a t e s and s i g n i f i c a n c e t e s t i n g o f t h e e f f e c t o f
target variables ”
17 E s t i m a t e . Std . E r r o r t v a l u e Pr ( >| t | )
18 female −0.154923 0.050162 −3.088 0 . 0 0 2 0 1 2 ∗ ∗
19 f e m a l e : widowed 0.136095 0.090663 1.501 0.133325
20 female : divorced 0.136939 0.022182 6 . 1 7 4 6 . 6 8 e −10 ∗ ∗ ∗
21 female : separated 0.023303 0.053212 0.438 0.661441
22 female : nevermarried 0.186853 0.019942 9 . 3 7 0 < 2 e −16 ∗ ∗ ∗
23 f e m a l e : hsd08 0.027810 0.120914 0.230 0.818092
24 f e m a l e : hsd911 −0.119335 0.051880 −2.300 0 . 0 2 1 4 3 5 ∗
25 female : hsg −0.012890 0.019223 −0.671 0 . 5 0 2 5 1 8
26 f e m a l e : cg 0.010139 0.018327 0.553 0.580114
27 f e m a l e : ad −0.030464 0.021806 −1.397 0 . 1 6 2 4 0 5
28 f e m a l e :mw −0.001063 0.019192 −0.055 0 . 9 5 5 8 1 1
29 female : so −0.008183 0.019357 −0.423 0 . 6 7 2 4 6 8
30 f e m a l e : we −0.004226 0.021168 −0.200 0 . 8 4 1 7 6 0
31 f e m a l e : exp1 0.004935 0.007804 0.632 0.527139
32 f e m a l e : exp2 −0.159519 0.045300 −3.521 0 . 0 0 0 4 2 9 ∗ ∗ ∗
33 f e m a l e : exp3 0.038451 0.007861 4 . 8 9 1 1 . 0 0 e −06 ∗ ∗ ∗
→ smaller gender gap for nevermarried or divorced female

Non-parametric framework

Homogeneous treatment effects: Partially linear model
Partially Linear Regression model PLR model

y = dα + g (X ) + ε
d = h(X ) + η
α is the target parameter, g and h are nuisance functions77
Naive ML approach:
1 ML of y − d α̂ on X → obtain ĝ (X )
2 OLS of y − ĝ (X ) on d → obtain α̂
Initialize with α̂ = 0 and iterate until convergence
However, α̂ is biased, because ĝ is not a good estimate of g 78
77
h maybe redondant, it is the propensity score in TE litterature
78
Since E (y |X ) 6= g (X ), a ML fit of y on X is not a good estimate of g
y = dα + g (X ) + ε
d = h(X ) + η
α is the target parameter, g and h are nuisance functions
Double Residuals (DR):
1 ML of y on X : compute residuals η̂y = y − ĝ (X )
2 ML of d on X : compute residuals η̂d = d − ĥ(X )
3 OLS of η̂y on η̂d → α̂
An application of FWL, or partialling out, with ML methods
√
Robinson (1988) shows that with DR α̂ is n-consistent, even
if ĝ (X ) and ĥ(X ) are consistent at slower rates79
The role of DR is to immunize α̂ against ML estimates: α̂ is
based on residuals η̂y and η̂d , which are ⊥ to ĝ (X ) and ĥ(X )
79
Robinson considers kernel regression. Chernozukhov et al. (2018) pdf
establish that any ML method can be used, so long as it is n1/4 -consistent
The role of double residuals (orthogonalization)
Distribution of α̂ − α0
Source: Chernozhukov et al. (2018)
Non-orthogonal ≡ naive ML Orthogonal ≡ Double Residuals

y = dα + g (X ) + ε
d = h(X ) + η
α is the parameter of interest, g and h are nuisance functions
Cross-fitting: split the sample into an auxiliary and a main
1 ML estimation of g (X ), h(X ) on auxiliary sample
2 Double Residuals estimation of α by OLS on main sample
α̂1 +α̂2
Flip the roles of both samples and average the results 2
Estimate nuisance fcts and target parameter on 6= samples
Chernozukhov et al. (2018) show that cross-fitting is crucial
to avoid overfitting
→ PLR: Double ML = Double Residuals + Cross-fitting

Heterogeneous treatment effects: Fully nonparametric
Interactive Regression Model IRM model
y = m(d, X ) + ε
d = h(X ) + η
d not additively separable → very general heterogeneity in TE
Parameter of interest: ATE = E[y1 − y0 ] 80
The estimator needs to check a Neyman-orthogonal condition

with respect to the nuisance functions (≡ DR in the PRL)
So the estimator and inference are robust to small mistakes in
the nuisance fonctions
The AIPW estimator turns out to check this ⊥ condition:

D(Y − m(1, X )) (1 − D)(Y − m(0, X ))
ATE = E m(1, X ) − m(0, X ) + −
h(X ) (1 − h(X ))
This estimator is doubly-robust: to small mistakes in m̂ and ĥ

80
The observed outcome is with or without treatment: y = y1 d + y0 (1 − d)
Heterogeneous treatment effects: Fully nonparametric
Interactive Regression Model IRM model

y = m(d, X ) + ε
d = h(X ) + η
d not additively separable → very general heterogeneity in TE
Double Machine Learning:81
1 Neyman orthogonal condition → AIPW estimator
2 Cross-fitting → ATE and m, h estimated from 6= samples
ATE estimation and inference with good properties

However, no detection and analysis of heterogeneity
→ IRM: Double ML = AIPW + Cross-fitting
81
Chernozhukov et al. (2018) pdf and Chernozhukov et al. (2017) pdf

Application 3: Insurance bonus on employment duration
RCT to investigate the incentive effect of unemployment

insurance (UI) bonus on unemployment duration:82
Individuals in the treatment groups were offered a cash bonus if they
found a job within some pre-specified period of time (qualification
period), provided that the job was retained for a specified duration
y is the log of duration of unemployment for the UI claimants

ATE estimation and inference in a PLR and IRM models
Pennsylvania Reemployment Bonus data set
Implementation is done with the R package DoubleML83
82
Individuals in the treatment groups were offered a cash bonus if they found
a job within some pre-specified period of time (qualification period), provided
that the job was retained for a specified duration
83
See vignette and Bach, Chernozhukov, Kurz, Spindler (2021) pdf
Application 3: ATE in a PLR model
1 l i b r a r y ( DoubleML )
2 l i b r a r y ( mlr3 )
3 # I n i t i a l i z a t i o n o f t h e Data−Backend
4 d a t a=f e t c h b o n u s ( r e t u r n t y p e=” d a t a . t a b l e ” )
5 y=” i n u i d u r 1 ”
6 d=” t g ”
7 x=c ( ” f e m a l e ” , ” b l a c k ” , ” o t h r a c e ” , ” dep1 ” , ” dep2 ” , ” q2 ” , ” q3 ” , ” q4 ” ,
” q5 ” , ” q6 ” , ” a g e l t 3 5 ” , ” a g e g t 5 4 ” , ” d u r a b l e ” , ” l u s d ” , ” hus d ” )
8 dml d a t a=DoubleMLData $new ( d at a , y c o l=y , d c o l s=d , x c o l s=x )
9 # I n i t i a l i z a t i o n o f t h e PLR Model
10 s e t . s e e d ( 3 1 4 1 5 ) #r e q u i r e d t o r e p l i c a t e s a m p l e s p l i t
11 l e a r n e r g=l r n ( ” r e g r . r a n g e r ” , num . t r e e s =500 , min . node . s i z e =2 ,
max . d e p t h =5) #Random F o r e s t from t h e r a n g e r p a c k a g e
12 l e a r n e r m=l r n ( ” r e g r . r a n g e r ” , num . t r e e s =500 , min . node . s i z e =2 ,
max . d e p t h =5)
13 dml p l r=DoubleMLPLR$new ( dml d ata ,
14 ml m = l e a r n e r m,
15 ml g = l e a r n e r g ,
16 s c o r e = ” p a r t i a l l i n g out ” ,
17 n f o l d s = 5 , n rep = 1)
18 # P e r f o r m t h e ATE e s t i m a t i o n and p r i n t t h e r e s u l t s
19 dml p l r $ f i t ( )
20 dml p l r $ summary ( )

Application 3: ATE in a PLR model
20 > dml p l r $ summary ( )

21 E s t i m a t e s and s i g n i f i c a n c e t e s t i n g o f t h e e f f e c t o f t a r g e t
variables
23 tg −0.07396 0.03540 −2.089 0.0367 ∗
24 −−−
25 S i g n i f . codes : 0∗∗∗ 0.001 ∗∗ 0.01 ∗ 0 . 0 5 . 0 . 1 1
Hence, we can reject H0 : α = 0 at the 5% significance level

It is consistent with the findings of previous studies that have
analysed the Pennsylvania Bonus Experiment
The ATE on unemployment duration is negative and
significant

Application 3: ATE in an IRM model
26 ## I n i t i a l i z a t i o n o f t h e IRM Model
27 # C l a s s i f i e r for propensity score
28 l e a r n e r c l a s s i f m = l r n ( ” c l a s s i f . r a n g e r ” , num . t r e e s = 5 0 0 ,
min . node . s i z e = 2 , max . d e p t h = 5 )
29 dml i r m=DoubleMLIRM$new ( dml d ata ,
30 ml m = l e a r n e r c l a s s i f m,
31 ml g = l e a r n e r g ,
32 s c o r e = ”ATE” , #o r ”ATTE”
33 n f o l d s = 10 , n rep = 1)
34 # P e r f o r m t h e e s t i m a t i o n and p r i n t t h e r e s u l t s
35 dml i r m $ f i t ( )
36 dml i r m $ summary ( )
37 E s t i m a t e s and s i g n i f i c a n c e t e s t i n g o f t h e e f f e c t o f t a r g e t
variables
39 tg −0.07345 0.03549 −2.069 0.0385 ∗
40 −−−
41 S i g n i f . codes : 0∗∗∗ 0.001 ∗∗ 0.01 ∗ 0 . 0 5 . 0 . 1 1
The estimated coefficient is very similar to the estimate of the ATE

in a PLR model and the conclusions remain unchanged.
Estimation of heterogeneity: Causal Forest
Causal Random Forest:84

Random Forest is modified to estimate the CATE directly
Grow a tree and evaluate its performance based on TE
heterogeneity rather than predictive accuracy
The idea is to find leaves where the treatment effect is
constant but different from other leaves
Split criterion: maximize heterogeneity in TE between leaves
Honest tree: build tree and estimate CATE from 6= samples
→ valid estimation and confidence intervals for CATE85
84
Wager and Athey (2018) pdf , Athey, Tibshirani and Wager (2019) pdf
85
RF√predictions are asymptotically unbiased and Gaussian, but cv rates
below n and they do not account for the uncertainty due to sample splitting
Detection and analysis of heterogeneity: Generic ML
Generic Machine Learning:86

Do not attempt to get valid estimation and inference on the
CATE itself, but on features of the CATE
Obtain ML proxy predictor of CATE (auxiliary set) and target
features of CATE based on this proxy predictor (main set)
Main interests:
Test if there is evidence of heterogeneity (BLP)
ATE for the 20% most (least) affected individuals? (GATES)
Which covariates are associated to TE heterogeneity? (CLAN)
→ valid estimation and inference on features of CATE
86
Chernozhukov, Demirer, Duflo and Fernàndez-Val (2020) pdf

Generic ML: Proxies of CATE
The main idea is to compute imperfect predictions of CATE and to
use them as proxies to make inferences on features of CATE:
Split the sample into a main set and auxiliary set (50/50 split)
Fit y ≈ m(1, X ) with treated group from the auxiliary sample
Fit y ≈ m(0, X ) with control group from the auxiliary sample
b i ) = m̂(1, Xi ) − m̂(0, Xi ) from the main sample
Compute S(X
S(X
b ) is used to learn about treatment effect heterogeneity
To control the uncertainty due to data splitting, this process is

done many times → cross-fitting87
b i ) are imperfect predictions of CATEi → proxies
The S(X 88
87
We randomly split the sample M times. The parameter estimates,
confidence bounds, and p-values reported are the medians across M splits.
88
CATEi = E[y1 − y0 |Xi ] = m(1, Xi ) − m(0, Xi )
Causal Machine Learning: A brief roadmap
Source: Gaillac and L’Hour (2021)

Underlying assumptions
Standard hypotheses: SUTVA, CIA and CSC
Common support condition (CSC): 0 < P(di = 1|Xi = x) < 1
ML estimation often provides better predictions
Adding covariates makes matching more difficult
Strittmatter and Wunsch (2021) The gender pay gap revisited with
big data: Do methodological choices matter?
Trimming in experiments vs. decomposition methods
→ Beware of CSC when moving away from RCT framework
Conclusion
The impact of ML for public policy evaluation:
Dealing with many covariates (p n)

Relying less on a priori specification
Take care of heterogeneity
However, do not forget underlying assumptions! (CSC)
Technical literature, where implementation becomes easier

- Double Lasso: R package hdm
- Double Machine Learning: R package DoubleML
- Generic Machine Learning: R package GenericML
- Generalized Random Forest: R package grf
An effervescent empirical and theoretical literature

Selected references in Causal ML
Athey (2017) Beyond prediction: Using big data for policy problems, Science
Athey (2018) The impact of machine learning on economics
Athey, Tibshirani and Wager (2019) Generalized random forest, Ann. Statis.
Bach, Chernozhukov and Spindler (2021) Closing the U.S. gender wage gap
requires understanding its heterogeneity, arXiv:1812.04345
Belloni, Chernozhukov and Hansen (2014) Inference on treatment effects after
selection amongst high-dimensional controls, REStud
Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey and Robins (2018)
Double/debiased ML for treatment and structural parameters. Econometrics J.
Chernozhukov, Demirer, Duflo and Fernàndez-Val (2020) Generic ML inference
on heterogenous treatment effects in randomized experiments, arXiv:1712.04802
Gaillac and L’Hour (2020) Machine Learning for Econometrics, Lecture notes
Kleinberg, Ludwig, Mullainathan and Obermeyer (2015) Prediction Policy
Problems, AER P&P
L’Hour (2020), L’économétrie en grande dimension. INSEE M2020-01
Strittmatter (2020) What is the value added by using causal machine learning
methods in a welfare experiment evaluation.
Strittmatter and Wunsch (2021) The gender pay gap revisited with big data:
Do methodological choices matter? arXiv:2102.09207
Wager and Athey (2018) Estimation and inference of heterogeneous treatment
effects using random forests. JASA

References
Berk (2016) Statistical Learning from a Regression Perspective.
Springer Texts in Statistics, ch.3-7
Charpentier (2018), Classification from scratch website
Charpentier, Flachaire and Ly (2018), Econometrics and Machine

Learning, Economics and Statistics, 505 english french
Efron and Hastie (2016) Computer Age Statistical Inference,

Cambridge University Press, ch.17-19 pdf
Hastie, Tibshirani and Friedman (2009) The Elements of Statistical
Learning. Springer, ch.7, 9-12, 15-16 website pdf
Hastie, Tibshirani and Wainwright (2015) Statistical Learning with

Sparsity: The Lasso and Generalizations. CRC Press, website pdf
James, Witten, Hastie and Tibshirani (2021) An Introduction to

Statistical Learning. Springer, ch.2,5,6,8,9 website pdf
Watt, Borhani and Katsaggelos (2016) Machine Learning Refined.

Cambridge University Press, ch.1-6

Machine Learning and Econometrics EF

Uploaded by

Copyright:

Available Formats

Machine Learning and Econometrics EF

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning and Econometrics EF

Uploaded by

Copyright:

Available Formats

What are the two cultures of statistical modeling discussed in the document?

What are the two cultures of statistical modeling discussed in the document?

What is the focus of machine learning according to the document?

What is the focus of machine learning according to the document?

Econometrics & Machine Learning

December 21, 2021

Emmanuel Flachaire Econometrics & Machine Learning

1 Introduction and General Principles

Emmanuel Flachaire Econometrics & Machine Learning

Emmanuel Flachaire Econometrics & Machine Learning

There are two cultures in the use of statistical modeling to reach

Algorithmic Modeling Culture: one uses algorithmic models

Léo Breiman (Statistical Science, 2001):

. . . an uncritical use of data models.

Emmanuel Flachaire Econometrics & Machine Learning

E (y − ŷ )2 = E [m(X ) + ε − m̂(X )]2

The focus of Machine Learning is to estimate m with the aim

(Source: Berk, 2016)

Reducible error = mean

Emmanuel Flachaire Econometrics & Machine Learning

(Source: Berk, 2016)

Reducible error = mean

Emmanuel Flachaire Econometrics & Machine Learning

Parametric econometric: we assume that the data come from

→ probability theory is a foundation of econometrics

Machine learning: we do not make any assumption on how

→ probability theory is not required

Nonparametric econometrics makes the link between the two

Emmanuel Flachaire Econometrics & Machine Learning

Find the solution m

which can be rewritten in Lagrangian form, for some λ ≥ 0:

The goal is to minimize a loss function under constraint

Emmanuel Flachaire Econometrics & Machine Learning

(Source: Watt et al., 2016)

Algorithm: Gradient descent

(Source: Watt et al., 2016)

(Source: Watt et al., 2016)

Emmanuel Flachaire Econometrics & Machine Learning

It is the minimization of the Sum of Squared Residuals (SSR) in a

Let us consider the simple linear regression model:

From a Machine Learning perspective:

Econometrics put statistical assumptions on (3) in order to derive

(Source: Watt et al., 2016)

A hyperplane placing the points on its correct side is as follows:

In other words, with y ∈ {−1, +1}:

(Source: Watt et al., 2016)

(Source: Watt et al., 2016)

is similar to maximize the log-likelihood in a logit model:

→ Logit model = softmax model

which is similar to maximize the log-likelihood in a logit regression

Let us consider the logit regression model:14

E(y 0 |X ) = P(y 0 = 1) = Λ(X β) (4)

From a Machine Learning perspective:

Econometrics put statistical assumptions on (4) in order to derive

Optimal parameters: Minimize prediction/classification errors

(Source: Watt et al., 2016)

A unique solution is easily obtained numerically/analytically.

Emmanuel Flachaire Econometrics & Machine Learning

(Source: Watt et al., 2016)

Non-linearity (left panel) and interaction effects (right panel).