Machine Learning and Econometrics EF
Machine Learning and Econometrics EF
Machine Learning and Econometrics EF
Emmanuel Flachaire
Aix-Marseille University, Amse
https://egallic.fr/ECB
1
Léo Breiman, Statistical Science, 2001, Vol. 16, No. 3, 199-231
Emmanuel Flachaire Econometrics & Machine Learning
Statistical Modeling: The two Cultures
y = Xβ + ε
y ≈ m(X )
Newton’s Method
Use quadratic approximations at each steps, from Taylor expansion3
Newton’s Method
Use quadratic approximations at each steps, from Taylor expansion
Let us consider:
L = `2 (Euclidian distance): L(yi , m(Xi )) = (yi − m(Xi ))2
m is a linear function of parameters: yi ≈ Xi β with β ∈ R p
no penalization: λ = 0
Thus, we have: ( n
)
X
β
b = argmin (yi − Xi β)2 ,
i=1
y = β0 + β1 x + ε (3)
5
In the sense that it minimizes prediction errors
6
convergence, unbiased/biased estimators, BLUE, statistical tests, etc.
Emmanuel Flachaire Econometrics & Machine Learning
Classification: a simple Machine Learning method
X β > 0 if yi = +1
X β < 0 if yi = −1
where max (0, −yi (Xi β)) is the perceptron or max loss function.
Emmanuel Flachaire Econometrics & Machine Learning
Classification: smooth version of the perceptron
ex
with yi0 ∈ {0, 1} and Λ(x) = 1+e x = 1
1+e −x
is the logistic function11
Let us consider:12
the softmax loss function: L = softmax(0, −yi (Xi β))
no penalization: λ = 0.
Thus, we have:13 ( n )
X
−y (X β)
βb = argmin log 1 + e i i ,
i=1
14
Since y 0 = {0, 1}, then E(y 0 |X ) = 0 × P(y 0 = 0) + 1 × P(y 0 = 1)
Emmanuel Flachaire Econometrics & Machine Learning
Linear/Logit models from a Machine learning perspective
Machine Learning:
High non-linearity and strong interaction effects are taken into
account with automatic feature design.
In general, a non-convex function is minimized.
Nonparametric Econometrics:
A nonparametric regression take into account such effects.
It may work well in small-dimension, not in high dimension.17
17
Because of the curse of dimensionality. Note that GAM models may
capture automatically non-linearities, but not interaction effects.
Emmanuel Flachaire Econometrics & Machine Learning
1. Introduction and General Principle
The two Cultures
Loss function and penalization
In-sample, out-sample and cross validation
A model with high flexibility may fit perfectly observations used for
estimation, but very poorly new observations
4
estimation: λ=0
true model
2
0
y
−2
−4
estimation: λ=∞
2
true model
0
y
−2
−4
The best model has the lowest prediction error. With squared
error loss, the mean squared prediction error is equal to
n
1X SSR
MSE = b λ (xi ))2 =
(yi − m
n n
i=1
Linear regression
y = Xβ + ε n observations, p covariates
Least Squares
Collinearity or many irrelevant covariates → high variance
More covariates than observations, p > n → undefined
n
X p
X
2
Minimize (yi − α − Xi β) + λ |βj |q
α,β
i=1 j=1
Pp q
It is equivalent to minimize SSR subject to j=1 |βj | ≤c
n
X p
X
Minimize (yi − α − Xi β)2 + λ |βj |q
α,β
i=1 j=1
Pp q
It is equivalent to minimize SSR subject to j=1 |βj | ≤c
n
X p
X
Minimize (yi − α − Xi β)2 + λ |βj |q
α,β
i=1 j=1
Pp q
It is equivalent to minimize SSR subject to j=1 |βj | ≤c
n
X p
X
2
Minimize (yi − α − Xi β) + λ βj2
α,β
i=1 j=1
n
X p
X
Minimize (yi − α − Xi β)2 + λ |βj |
α,β
i=1 j=1
p
X
Lasso constraint: |βj | ≤ c
j=1
19
Breiman, Friedman, Stone, Olshen (1984) Classification and Regression
Trees
Emmanuel Flachaire Ridge and Lasso Regressions
Simulation results with uncorrelated covariates
21
Zou and Hastie
P (2005), JRSSP Series B, 67 301-320. It corresponds to the
penalization λ1 pj=1 βj2 + λ2 pj=1 |βj |, where λ1 and λ2 are selected by CV.
Emmanuel Flachaire Ridge and Lasso Regressions
Adaptive Elastic-net
j=1
1 c o e f 1=c o e f ( o l s )
2 c o e f 2=c o e f ( c v . r i d g e , s=” lambda . min ” )
3 c o e f 3=c o e f ( c v . l a s s o , s=” lambda . min ” )
4 c o e f 4=c o e f ( c v . l a s s o , s=” lambda . 1 s e ” )
5 c o e f 5=c o e f ( c v . a d a l a s s o , s=” lambda . min ” )
6 c o e f 6=c o e f ( c v . a d a e l a s t , s=” lambda . min ” )
7 o p t i o n s ( s c i p e n = 999) # d i s a b l e s c i e n t i f i c n o t a t i o n
8 c o e f f=c b i n d ( c o e f 1 , c o e f 2 , c o e f 3 , c o e f 4 , c o e f 5 , c o e f 6 )
9 c o l n a m e s ( c o e f f ) <− c ( ” o l s ” , ” r i d g e ” , ” l a s s o ” , ” l a s s o . 1 s e ” , ”
adaLasso ” , ” a d a E l a s t i c ” )
10 coeff
24
See https://freakonometrics.hypotheses.org/52776
Emmanuel Flachaire Classification and Regression Tree (CART)
Classification Tree: graphical representation
Minimizing misclassification, we find x2 < k, where k ∈ (0.56, 0.8)
With a large sample, a tree may have many nodes, that is,
many points where the predictor is splitted into two leaves
An unpruned tree:
classifies correctly every observation from a training sample
may classify poorly observations from a test sample
(it is the standard problem of overfitting)
maybe difficult to interpret
A pruned tree:
is a smaller tree with fewer splits
might perform better on a test sample
might have an easier interpretation
A fully grown tree fits perfectly training data, poorly test data
∆ = G (N ) − G (NL , NR ) >
1 l i b r a r y ( ISLR )
2 # remove o b s e r v a t i o n s t h a t a r e m i s s i n g S a l a r y v a l u e s
3 d f=H i t t e r s [ c o m p l e t e . c a s e s ( H i t t e r s $ S a l a r y ) , ]
4 # l o a d CART l i b r a r y
5 library ( rpart )
6 library ( rpart . plot )
7 # estimate the t r e e
8 t r e e <− r p a r t ( l o g ( S a l a r y ) ˜ Y e a r s+H i t s+AtBat , d a t a=d f , cp =0)
9 # p l o t the t r e e
10 p r p ( t r e e , e x t r a =1 , f a c l e n =5)
based on the same principle: find terminal nodes that min SSR
Emmanuel Flachaire Classification and Regression Tree (CART)
Regression: Tree pruning
A fully grown tree fits perfectly training data, poorly test data
27
λ is called the complexity parameter
Emmanuel Flachaire Classification and Regression Tree (CART)
Tree pruning: Application
1 t r e e <− r p a r t ( l o g ( S a l a r y ) ˜ Y e a r s+H i t s+AtBat , d a t a=d f ) # CV
2 p r p ( t r e e , e x t r a =1 , f a c l e n =5)
J
P
Regression tree: m(X ) = cj 1(x ∈ Rj )
j=1
1.0
0.8
0.8
0.6
0.6
Variable 2
Variable 2
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Variable 1 Variable 1
Blue area ≡ fitted values in blue from linear (left) and tree (right) models
1.0
0.8
0.8
0.6
0.6
Variable 2
Variable 2
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Variable 1 Variable 1
Blue area ≡ fitted values in blue from linear (left) and tree (right) models
1.0
0.8
0.8
0.6
0.6
Variable 2
Variable 2
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Variable 1 Variable 1
Blue area ≡ fitted values in blue from linear (left) and tree (right) models
Advantages:
Disadvantage:
If one diagnosis occurs more than any others, you may choose
this one as the final diagnosis
Algorithm 1: Bagging
Select number of trees B, and tree depth D;
for b = 1 to B do
generate a bootstrap sample from the original data;
estimate a tree model of depth D on this sample;
end
0
0 786 523
791 518 0
814 495
age >= 14 pclass = 3rd
0 1
669 176 122 342 0 1 pclass = 2nd,3rd sibsp >= 2.5 sibsp >= 2.5
683 142 131 353 1
15 241
0 1 1
651 146 22 36 98 100
age >= 34 age >= 32 sibsp >= 2.5 pclass = 2nd,3rd sibsp >= 2 parch >= 1.5 age >= 57
1
20 240
pclass = 2nd age >= 54 parch >= 3.5
0 1 0
0 0 0
16 3 6 33 15 3
558 113 111 63 102 102 0 0 1 1
646 116 37 26 106 117 25 236 0 0 1
532 72 119 74 83 97
age >= 14 age < 48 sibsp < 0.5 age < 14 age >= 28 parch < 0.5 age >= 61 sibsp >= 1.5 age >= 23 sibsp < 0.5 age < 19 age < 52 age >= 1.5
0
0 1 1 0
153 9 33 2
0 0
6 1
129 5
32 0 5 26 6 9
0 0 1
0 0 1 0 1
429 108 94 40 17 23 21 5 81 97
0 0 0 1 1 379 63 86 72 77 96
0 0 0 1
297 59 40 20 46 43 69 88
age < 19 sibsp >= 0.5 age >= 10 age < 34 age >= 54 age >= 24 age >= 45 age < 34 sibsp < 0.5 parch < 0.5 age < 15 age >= 50
0 0 0 0 1 1 0 0 0 1 1 1
23 3 18 0 16 3 6 2 6 10 6 22 344 51 85 26 13 6 1 7 2 7 0 29 age < 39 age < 49 sibsp < 0.5 age < 15
0 0 0 1
0 0 1 0 1 1
337 56 61 31 8 18 45 18 15 19 37 63
0 0 1 0 1 1 9 8 9 1 6 2 0 8
128 7 32 19 48 51 16 15 10 29 14 86 0 0 1 1
288 51 31 19 40 41 69 80
age >= 20 pclass = 3rd age >= 37 age < 44 pclass = 2nd age >= 33 age >= 29 parch < 0.5
0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 1 0 1 1 0 0 1 1 0
35 0 19 3 5 4 3 14 8 0 11 5 4 14 7 3 30 60 48 0 32 12 0 7 36 35 12 16 10 2 6 13 4 3 6 26 3 5 4 3 27 12 4 7 4 12 7 1
0 0 0 0 0 1
302 56 42 28 37 18
0 1 284 48 36 29 62 79
80 7 11 81
0 1 0 1 0 1 1 0 1
286 40 9 12 35 11 2 7 77 2 3 5 3 52 263 48 52 54
0 0
1
16 16 33 16
8 29 age >= 28 age < 22
1 1
6 7 6 8
0 0
0 1 0 1
9 3 7 13 10 0 0 7
age < 27 age >= 27
0
1 0
42 6 9 0
0
23 16
8 22
0 1
215 35 37 46
0 1 0 1 0 1 0 1
14 2 9 14 4 3 4 19 210 29 5 6 17 13 20 33
0
831 478 0 0
809 500 827 482
0 1
702 173 129 305
0 1 0 1
700 171 109 329 688 163 139 319
pclass = 2nd,3rd sibsp >= 2.5 parch >= 1.5
1
19 214
0 0 0 pclass = 2nd,3rd sibsp >= 2.5 age >= 5.5 pclass = 2nd,3rd sibsp >= 2.5 parch >= 0.5 pclass = 2nd
0 1 1 0 1 0 1
age < 18 age >= 55 age >= 37 age >= 20
677 137 23 34 97 106 661 132 27 31 113 86 26 233
0 1
27 1 2 28
0 0 0 1
548 77 125 67 39 10 71 81
age >= 32 age >= 55 parch >= 1.5 age >= 32 age >= 49 age >= 37 age >= 25 parch < 1.5
0 1 1 0 1 1
19 1 4 33 4 17 22 0 5 31 6 131
age >= 32 age < 49 age < 23 age >= 28 sibsp >= 0.5
0 0 0 0 0 0 0 0 0 1 1
33 1 37 4 16 0 549 88 128 49 93 89 523 82 138 50 58 20 55 66 20 102
0 0 0 0 1
515 76 88 63 23 10 55 54 16 27
age < 29 age < 48 age >= 17 age < 20 age < 48 sibsp >= 0.5 age < 22 age >= 22
0 0 0 0 0 0 0 1
age < 32 age >= 37 age < 23
139 8 25 3 23 4 129 5 40 6 16 0 18 8 0 28
0 1 0 1 0 0 1
130 15 9 18 19 4 4 6 25 13 9 8 7 19
0 0 1 0 0 0 1 1
0 0 1 410 80 103 46 70 85 394 77 98 44 42 20 37 58 20 74
385 61 79 45 30 41
age < 34 age >= 30 age >= 46 age < 22 age < 32 parch < 0.5 age >= 20 age < 27
0 1 0 0 1 0 1 1 0 1 0 1 1 1
380 51 5 10 38 7 9 6 21 35 360 58 7 10 2 7 63 1 1 6 39 9 3 11 5 12 2 28
0
0 0 1 0 0 1 1
41 38
50 22 96 36 68 78 331 76 97 38 32 46 18 46
age >= 32
1
age < 43 parch < 0.5 age >= 27 parch < 0.5 age >= 28 age >= 25
0 7
0 1 0 0 1 0 1 1
0 41 12 9 10 15 0 16 13 9 10 8 0 24 46 8 35
41 31
0 0 1 0 0 1
81 36 20 9 48 69 315 63 88 28 10 11
age < 31
0
7 0
age >= 39 age < 33 parch >= 1.5 age < 34
0
1 0 1 1 0 0 0 1
34 31
3 5 19 3 1 6 22 40 301 53 9 8 5 3 5 8
0 1 0 0
sibsp >= 0.5 78 31 26 29 14 10 79 20
1
2 9
0
32 22 age < 32 age >= 37 age >= 46
0 0 0 1 0
14 3 13 3 8 0 6 10 22 0
age >= 27
0 1 0
0
11 2 64 28 13 26 57 20
0
21 20
0 1 0 1
0 1
15 12 2 5 12 6 3 6 41 16 3 4
sex = mal
sex = mal no
yes no
sex = mal
yes
yes no
0
831 478
0
814 495 age >= 4.5 pclass = 3rd
0
778 531
0 1
696 156 135 322
age >= 9.5 pclass = 3rd
pclass = 2nd,3rd pclass = 3rd sibsp >= 2.5 age >= 9.5 pclass = 3rd
1
15 215
0 1 0 1 0
669 166 145 329 689 140 7 16 120 107
0 1
age >= 32 age >= 54 age >= 17 645 176 133 355
0 1 0
7 4 0 12 22 3
pclass = 2nd,3rd sibsp >= 3 sibsp >= 2.5 age >= 57
0 0 1
573 82 116 58 98 104
pclass = 2nd,3rd sibsp >= 2.5 sibsp >= 2.5
age < 32 age < 48 age < 22 1
0 1 0 1 0 0 1
649 133 20 33 129 106 16 223 146 9 27 5 8 20
15 244
0 0 0
427 73 89 53 90 84 0 1 0
632 157 13 19 118 111
age >= 46 age >= 3.5 age >= 17 sibsp >= 0.5 age >= 9.5 age >= 41 age >= 50 age >= 20 parch >= 3.5
0
0 1 0 1 8 8
531 72 2 27 21 1 3 11
0 0 1 0 1
419 65 80 41 9 12 28 7 62 77 age >= 55 age >= 28
0 0 0 1 0 0 1 0
118 61 18 6 108 105 13 212 age < 19 parch < 0.5 age < 19 age >= 28 502 93 12 3 1 16 14 0
0 0 0 1 0 0
11 6 13 2 6 5 3 7 15 0 6 1
0 0 0 1
0 1
parch >= 0.5 sibsp < 0.5 age >= 39 pclass = 2nd 408 59 67 39 13 7 56 76 130 64 104 111
0 1 1 1
13 0 5 6 6 23 5 130 parch < 0.5 age < 33 sibsp < 0.5 age < 26
0 1 0 1
74 2 7 8 10 3 3 4
0 0 0 1 age < 34 age < 31 age < 22
50 14 68 47 102 82 8 82 0
334 57
0
60 31
1
26 28
1
30 48 0
33 3
sibsp >= 0.5 pclass = 2nd age < 38 age < 33 sibsp >= 0.5
0 1 1
age < 48 parch < 0.5 parch >= 0.5 age < 22 age >= 16 51 21 9 10 2 7 0 0 1
0 0 1 0 0 0 1 1
97 61 41 17 63 94
19 0 19 3 1 58 318 49 16 8 13 8 13 20 28 41
0 0 1 0 1 age < 29
0 0 1 0 1 0 1 0 1
31 14 55 28 13 19 83 79 7 24 46 0 12 1 4 7 10 3 3 5 6 2 7 18 4 3 24 38 age >= 37 age >= 39 age >= 20
0 0 0 1
272 49 33 12 15 2 11 25
age >= 55 age < 26 age >= 20 parch < 0.5 age < 30 age >= 28 age >= 30
0 1 0 1 1 0 0 1
10 0 6 9 5 3 8 16 0 10 64 49 26 15 52 69
0 0
0 0 0 1 1 252 41 20 8
21 14 49 19 21 10 62 69 7 14
age < 25
0 0 1 age < 48 age < 15
20 0 16 3 4 5 1 0 1 0
age >= 36 sibsp < 0.5 age >= 24 0 3 12 18 4 8 11 13 2
232 41
0 1 0 0 1 0 1
15 5 6 9 7 0 9 0 10 25 7 4 0 10 0 1
pclass = 3rd
0 61 37 39 67
0 0 0 29 9
42 19 12 10 52 44 0
203 32
age < 20
parch >= 0.5 age >= 7
age < 32 0 0 1
20 8
0 0 1 0 1 47 20 18 47
9 5 9 6 3 4 48 32 4 12 0
183 24
1 0
0 age >= 22
14 17 21 20
33 14 0
11 0
0
172 24
0 1 0 1 0 1
31 8 2 6 0 0 8 1 6 16 12 4 9 16
34 3 30 7
Prediction:
Impact of bootstrapping:
28
The variance of the mean of the observations X̄ is given by σ 2 /n
Emmanuel Flachaire Bagging and Random Forest
Random Forest: algorithm
3 >maxnode=c ( 1 0 , 5 0 , 1 0 0 , 5 0 0 , 1 0 0 0 , 2 0 0 0 )
4 > f o r ( i i n 1 :NROW( maxnode ) ) { # OOB−MSE, d e p t h c o n t r o l
5 > aa=r a n d o m F o r e s t ( y ˜ x , maxnodes=maxnode [ i ] ) $ mse [ 5 0 0 ] ;
6 > p r i n t ( c ( maxnode [ i ] , aa ) ) }
7 [ 1 ] 10.0000000 0.3747725
8 [ 1 ] 50.0000000 0.2553131
9 [ 1 ] 100.0000000 0.2508479
10 [ 1 ] 500.0000000 0.2570217
11 [ 1 ] 1000.000000 0.268357
12 [ 1 ] 2000.0000000 0.2921307
Advantages:
They are robust to the original sample and more efficient than
single trees
Disadvantage:
30
Each subsequent model pays more attention to the errors from previous
models . . . it is a process that learns from past errors
31
Weak classifier: its error rate is only slightly better than random guessing
Emmanuel Flachaire Boosting
Boosting for Regression
y ∈ R is a quantitative variable
Number of trees, B:
Depth of trees, D:
32
By averaging over a large number of trees, bagging and random forests
reduces variability. Boosting does not average over the trees
Emmanuel Flachaire Boosting
Boosting: Shrinkage
33
Typical values are λ = 0.01, or λ = 0.001
34
By default, D = 1 and λ = 0.1 in the gbm function in R
Emmanuel Flachaire Boosting
Boosting: a simple regression example
Let us consider a realistic (simulated) sample
1 set . seed (1)
2 n=200
3 x=r u n i f ( n )
4 y=s i n ( 1 2 ∗ ( x +.2) ) / ( x +.2) + rnorm ( n ) / 2
(Source: Bishop 2006, Pattern recognition and machine Learning, Figure 14.1)
(Source: Bishop (2006), Pattern recognition and machine Learning, Figure 14.2)
→ Gradient boosting
L = |yi − m(xi )|
L = 1(sign[m(x)] 6= y )
L = e −ym(x)
L = log(1 + e −2ym(x) )
→ More weight for obs. more clearly misclassified (large negative ym(x))
→ When robustness is a concern, exponential loss is not the best criteria
Emmanuel Flachaire Boosting
Tuning parameters
37
Very small λ can require very large B in order to achieve good performance
Emmanuel Flachaire Boosting
Boosting: Interpretation
The data for this example consists of information from 4601 email
messages, in a study to try to predict whether the email was spam.
The response variable is binary, with values email or spam, and
there are 57 predictors as described below:
48 quantitative predictors: the percentage of words in the email
that match a given word40
6 quantitative predictors: the percentage of characters in the email
that match a given character (; ! # ( [ $)
Uninterrupted sequences of capital letters: average length (CAPAVE),
length of the longest (CAPMAX), sum of the length (CAPTOT)
40
Examples include business, address, internet, free, and george. The
idea was that these could be customized for individual users (Hastie et al, 2009)
Emmanuel Flachaire Boosting
Application: spam email
→ effect of Xj on m(X ) after accounting for the average effects of the other variables
It is an optimization problem:
41
We get rid of kβk = 1 by replacing the restriction with yi (Xi β) ≥ Mkβk
and, by setting kβk = 1/M, see Hastie et al (2009, section 4.5.2)
Emmanuel Flachaire Support Vector Machine (SVM)
Sensitivity to individual observations
Adding one blue observation leads to a quite different hyperplane,
with a significant decrease of the distance between the two margins
Underlying principles:42
SVC: maximal margin classifier, tolerating margin violations
Logit: minimize misclassification error
42
Figures: logit ≈ SVC (left), logit=solid line & SVC=dashed line (right)
Emmanuel Flachaire Support Vector Machine (SVM)
How to tolerate margin violations?
It is a slightly modified optimization problem:
44
max margin: pushing away the obs. as far as possible from the hyperplane
min missclassif: smallest aggregated distance from the hyperplane of wrong obs
Emmanuel Flachaire Support Vector Machine (SVM)
Support Vector Classifier
The non-separable case
x−150 x−150 2
Using the projection ϕ : x 7→ 10 , 10 , we obtain:
In the resolution, ϕ only appears in the form ϕ(Xi )> ϕ(Xj ). Thus,
K (X , X 0 ) = (1 + hX , X 0 i) = (1 + X1 X10 + X2 X20 )
Pn
46
With two n-vectors, the inner product is: hx1 , x2 i = x1> x2 = i=1 xi yi
Emmanuel Flachaire Support Vector Machine (SVM)
Polynomial kernel
47
Bias-variance tradeoff: large value of γ leads to high variance (overfitting),
small value leads to low variance and smoother boundaries (underfitting)
Emmanuel Flachaire Support Vector Machine (SVM)
Illustration: Simulated data
1 set . seed (1)
2 x=m a t r i x ( rnorm ( 2 0 0 ∗ 2 ) , n c o l =2)
3 x [1:100 ,]= x [1:100 ,]+2
4 x [101:150 ,]= x [101:150 ,] −2
5 y=c ( r e p ( 1 , 1 5 0 ) , r e p ( 2 , 5 0 ) )
6 p l o t ( x [ , 2 ] , x [ , 1 ] , pch =16 , c o l=y ∗ 2 )
48
We then have 2 tuning parameters, the cost of constraints violation and γ
Emmanuel Flachaire Support Vector Machine (SVM)
Illustration: Polynomial vs. radial kernels
We compute scores of the form fˆ(X ) = ϕ(Xi )β̂ for each obs.
Scores = predicted values.
We can fit a SVM with radial kernel and plot a ROC curve:
21 set . seed (1)
22 t r a i n=s a m p l e ( 2 0 0 , 1 0 0 )
23 t r a i n=s o r t ( t r a i n , d e c r e a s i n g=TRUE) # t o a v o i d r e v e r s e ROC
24 s v m f i t=svm ( y ˜ . , d a t a=d a t [ t r a i n , ] , k e r n e l=” r a d i a l ” , c o s t =1 ,
gamma=0.5)
25 f i t =a t t r i b u t e s ( p r e d i c t ( s v m f i t , d a t [− t r a i n , ] , d e c i s i o n . v a l u e s=
TRUE) ) $ d e c i s i o n . v a l u e s
26 r o c p l o t ( f i t , d a t [− t r a i n , ” y ” ] , main=” T e s t Data ” , c o l=” r e d )
y ≈ α + βx
2
1
1
0
0
y
y
−1
−1
−2
−2
−3
−3
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
1.0
1.0
2
0.8
0.8
0.8
1
0.6
0.6
0.6
0
u^2
u^3
u
y
0.4
0.4
0.4
−1
0.2
0.2
0.2
−2
−3
0.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
u u u x
f(α1 + b1x) f(α2 + b2x) f(α3 + b3x) α + β1f(α1 + δ1x) + β2f(α2 + δ2x) + β3f(α3 + δ3x)
1.0
2
0.8
0.6
0.8
1
0.6
0.6
0.4
0
f1
f2
f3
y
0.4
0.4
−1
0.2
0.2
−2
0.2
−3
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x x x
53
With M = 1 and the logistic activation function, it is a logit model
Emmanuel Flachaire Neural Networks and Deep Learning
Interaction effects
54
θ is the set of coefficients α, β, δ
Emmanuel Flachaire Neural Networks and Deep Learning
Backpropagation algorithm
55
Rumelhart et al.: Learning Internal Representations by Error Propagation
Emmanuel Flachaire Neural Networks and Deep Learning
Application 1: Mincer equation
1 l i b r a r y (AER) ; d a t a ( ” CPS1985 ” )
2 CPS1985 $ g e n d e r=a s . n u m e r i c ( CPS1985 $ g e n d e r )
3 library ( neuralnet )
4 nn=n e u r a l n e t ( l o g ( wage ) ˜ e d u c a t i o n+e x p e r i e n c e+g e n d e r , d a t a=
CPS1985 , h i d d e n =3 , t h r e s h o l d =.05)
5 p l o t ( nn )
1 1
−2
3.1
44
37
education −1.85739
−00 .456
0.
16
.0 21
57
54
6
94
0.95
1.50
919
42
6
69
31
1.
−2.8359
.6 5
29
−0.6872
5
93
95
19
4
1.
gender −1.54855
From single layer with many neurons to multilayer with less neurons
2.1
6.2
0.3
−5
.0
26
26
95
19
.63
08
54
07
36
6
−−0
0.8
−1.
−0.
1−1
−1
.0 1837
.7 .1967
18 .22332
0.3 125
0.8 967
.6
01
38
57
52
63
.2
59
66
33
96
15
1
15
95
98
78
−0.63
34.09
−11.5
1.502
0.398
0.920
1.232
62
27
534
153
29
05
44
75
86
09
18
76
73
7
36
77
9
7.8
0.9
98
.6
.8
2.8
−1
−1
−5
−2
experience 0.03246 −1063.56013 −0.6148 log(wage) −0.02541 −2.30108 −1.05663 1.71383 log(wage)
0.2
1.1
−3
−2
35
.2
.7
0.1
09
31
15
86
39
39
45
33
3
85
−24.08
−4.652
−0.890
−1.314
12.061
81
7.7 42
4.7 62
87
37
50 8
126
26
63
−3 .236
−1 0.16
−1 .661
1.04174
28 0155
83
28
81
14
93
94
49
07
42
43
74
16
02
.8
.6
−8
−2
19
4.8
2.
0.
0.4
gender 1.88576 0.69849 −3.08678 0.22792 93.02464
56
L1 : 785×256=200,960 and L2 : 257×128=32,896 and 10-outputs 129×10
Emmanuel Flachaire Neural Networks and Deep Learning
Dropout regularization
Solutions:
The network fails to recognize ’8’ when the letter is not centered
→ translation, scale and (small) rotation invariances are needed
The solution is convolution
61
a small distortion in input will not change the output of Pooling – since we
take the maximum/average value in a local neighborhood
62
This is very powerful since we can detect objects in an image no matter
where they are located
Emmanuel Flachaire Neural Networks and Deep Learning
CNN: The classification step 3
Step 3: Make a final prediction with a fully-connected network
Source: Internet
yt ≈ α + At β
At = [f (α1 + Xt δ1 ), . . . , f (αM + Xt δM )]
A linear combination of a nonlinear fct of linear combinations of Xt
Recurrent neural network:
Each time series provides many short mini-series of input
sequences X = {X1 , . . . , XL } of L periods, and a target Y
yt ≈ α + At β
See section 10.9.6 in James et al. (2021) for details of the implementation in R
Emmanuel Flachaire Neural Networks and Deep Learning
RNN and AR models
yt ≈ α + At β
Géron (2017)
n
X p
X
2
Minimize (yi − Xi β) + λ |βj |q
β
i=1 j=2
Pp q
It is equivalent to minimize SSR subject to j=2 |βj | ≤c
→ High-dimensional problems (p n)
n
X Z
2
Minimize (yi − m(Xi )) + λ m00 (x)2 dx
m
i=1
m00 (x)2 dx ≤ c
R
It is equivalent to minimize SSR subject to
Pros:
High-dimensional problems
Complex functional forms
However,
Black-box models
Prediction is not causation
y = β0 + β1 x + β3 x 3 + ε (5)
H0 : y = β0 + β1 x + ε vs. H1 : y = β0 + β1 x + β2 x 2 + ε
H0 : y = β0 + β1 x + ε vs. H1 : y = m(x) + ε
Parametric model:
y = Xβ + ε
Fully-nonparametric model:
y = m(X ) + ε
medv = X β + ε
medv = X β1 + X 2 β2 + X 3 β3 + (X :X )β4 + ε
medv = m(X ) + ε
63
X = [chas,nox,age,tax,indus,rad,dis,lstat,crim,black,rm,zn,ptratio]
Emmanuel Flachaire Misspecification detection
Application 1: Boston housing prices
1 l i b r a r y (MASS) ; l i b r a r y ( r a n d o m F o r e s t ) ; l i b r a r y ( gbm ) ; l i b r a r y ( g l m n e t )
2 d a t a ( B o s t o n ) ; n o b s=nrow ( B o s t o n )
3 s e t . s e e d ( 1 2 3 4 5 ) ; n f o l d =10
4 K f o l d=c u t ( s e q ( 1 , n o b s ) , b r e a k s=n f o l d , l a b e l s =FALSE )
5 mse . t e s t=m a t r i x ( 0 , n f o l d , 4 )
6 # g e n e r a t e Xˆ2 Xˆ3 and p a i r w i s e i n t e r a c t i o n s f o r t h e L a s s o
7 X c o l=c o l n a m e s ( B o s t o n ) [ −14]
8 X s q r=p a s t e 0 ( ” I ( ” , Xcol , ” ˆ 2 ) ” , c o l l a p s e=”+” ) # s q u a r e d c o v a r i a t e s
9 Xcub=p a s t e 0 ( ” I ( ” , Xcol , ” ˆ 3 ) ” , c o l l a p s e=”+” ) # c u b i c c o v a r i a t e s
10 f m l a=p a s t e 0 ( ”medv˜ ( . ) ˆ2+” , Xsqr , ”+” , Xcub )
11 X=model . m a t r i x ( a s . f o r m u l a ( f m l a ) , d a t a=B o s t o n ) [ , −1]
12 y=B o s t o n [ , 1 4 ]
13 mysample=s a m p l e ( 1 : n o b s ) # random s a m p l i n g ( p e r m u t a t i o n )
14 f o r ( i i n 1 : n f o l d ) { # K−f o l d CV
15 c a t ( ”K−f o l d l o o p : ” , i , ”\ r ” )
16 t e s t=mysample [ w h i c h ( K f o l d==i ) ]
17 t r a i n=mysample [ w h i c h ( K f o l d !=i ) ]
18 # OLS , L a s s o , Random F o r e s t , B o o s t i n g
19 f i t . lm <− lm ( medv˜ . , d a t a=Boston , s u b s e t=t r a i n )
20 f i t . l a <− c v . g l m n e t (X [ t r a i n , ] , y [ t r a i n ] , a l p h a =1)
21 f i t . r f <− r a n d o m F o r e s t ( medv˜ . , d a t a=Boston , s u b s e t=t r a i n , mtry =6)
22 f i t . bo <− gbm ( medv˜ . , d a t a=B o s t o n [ t r a i n , ] , d i s t r i b u t i o n =” g a u s s i a n ” ,
i n t e r a c t i o n . d e p t h =6)
23 # out−s a m p l e MSE
24 mse . t e s t [ i , 1 ] = mean ( ( B o s t o n $medv−p r e d i c t ( f i t . lm , B o s t o n ) ) [− t r a i n ] ˆ 2 )
25 mse . t e s t [ i , 2 ] = mean ( ( y−p r e d i c t ( f i t . l a , X , s=” lambda . min ” ) ) [− t r a i n ] ˆ 2 )
26 mse . t e s t [ i , 3 ] = mean ( ( B o s t o n $medv−p r e d i c t ( f i t . r f , B o s t o n ) ) [− t r a i n ] ˆ 2 )
27 mse . t e s t [ i , 4 ] = mean ( ( B o s t o n $medv−p r e d i c t ( f i t . bo , B o s t o n ) ) [− t r a i n ] ˆ 2 )
28 }
29 mse=c o l M e a n s ( mse . t e s t ) # t e s t e r r o r
30 r o u n d ( mse , d i g i t s =2)
31 [ 1 ] 23.93 14.88 10.16 10.34
y = g1 (X1 ) + . . . + gp (Xp ) + Z γ + ε
y = g1 (X1 ) + . . . + gp (Xp ) + Z γ + ε
67
Model Confidence Set (MCS) test can be used to test if the MSE are
significantly different (Hansen, Lunde and Nason 2011) pdf Pairwise AUC
can be used in classification (Candelon, Dumitrescu and Hurlin 2012) pdf
Emmanuel Flachaire Misspecification detection
Conclusion
Athey (2017) Beyond prediction: Using big data for policy problems
Pure prediction methods are not helpful for causal problems
Which patients should be given priority to receive surgery?
Estimating heterogeneity in the effect of surgery is required
68
ML are used to predict the probability that a candidate would die within a
year from other causes. Identify high risk patients who shouldn’t receive surgery
Emmanuel Flachaire Causal Machine learning
High-dimensional parametric framework
y = dα + X β + ε
y = dα + X ∗ β + ε
y = dα + X ∗∗ β + ε
69
Belloni, Chernozhukov and Hansen (2014) pdf Uniformly valid confidence
set for α despite imperfect model selection, and full efficiency for estimating α
Emmanuel Flachaire Causal Machine learning
Post-selection inference: Partialling out
η̂y = η̂d α + ε
72
See Belloni, Chernozhukov and Hansen (2014) pdf c = 1.1 for
post-Lasso and c = 0.5 for Lasso, γ = .1 by default
Emmanuel Flachaire Causal Machine learning
Bias of naive post-selection estimation
y = dα + X β + ε
1 l i b r a r y ( hdm )
2 d a t a ( ” GrowthData ” ) # t h e 2 nd column i s a v e c t o r o f one #
3 y=a s . m a t r i x ( GrowthData ) [ , 1 , d r o p=F ]
4 d=a s . m a t r i x ( GrowthData ) [ , 3 , d r o p=F ]
5 X=a s . m a t r i x ( GrowthData ) [ , − c ( 1 , 2 , 3 ) , d r o p=F ]
6 # f i t models
7 LS . f i t =lm ( y ˜d+X)
8 PO . f i t = r l a s s o E f f e c t (X , y , d , method=” p a r t i a l l i n g o u t ” )
9 DS . f i t = r l a s s o E f f e c t (X , y , d , method=” d o u b l e s e l e c t i o n ” )
10 # i n f e r e n c e on c o e f o f i n t e r e s t
11 LS=summary ( LS . f i t ) $ c o e f f i c i e n t s [ 2 , ]
12 PO=summary (PO . f i t ) $ c o e f f i c i e n t s [ 1 , ]
13 DS=summary (DS . f i t ) $ c o e f f i c i e n t s [ 1 , ]
14 r b i n d ( o l s=LS , d o u b l e . s e l e c t i o n=DS , p a r t i a l l i n g . o u t=PO)
74
It is not surprising given that p = 63 is comparable to n = 90.
Emmanuel Flachaire Causal Machine learning
Heterogeneous treatment effects: high-dimensions
If d is a treatment, we can consider heterogeneous effects as
y = dα(X ) + g (X ) + ε
y = dα + dX β + Z γ + ε
y is the log of the wage, d is a dummy for female
dX are the interactions between d and each covariate in X
Z includes 2-ways interactions of the covariates Z = [X , X :X ]
The target variable is female d, in combination with other
variables dX
Our main interest is to make inference on α and β
If β = 0: homogeneous wage gender gap given by α
If β = 0: heterogeneous wage gender gap explained by X
76
for a recent application see Bach, Chernozhukov and Spindler (2021) pdf
1 l i b r a r y ( hdm )
2 data ( cps2012 )
3 y <− c p s 2 0 1 2 $ lnw
4 X <− model . m a t r i x ( ˜−1+f e m a l e+f e m a l e : ( widowed+d i v o r c e d+
s e p a r a t e d+n e v e r m a r r i e d+h s d 0 8+h s d 9 1 1+h s g+cg+ad+mw+s o+we+
e x p 1+e x p 2+e x p 3 ) +(widowed+d i v o r c e d+s e p a r a t e d+n e v e r m a r r i e d
+h s d 0 8+h s d 9 1 1+h s g+cg+ad+mw+s o+we+e x p 1+e x p 2+e x p 3 ) ˆ 2 , d a t a
= cps2012 )
5 X<−X [ , w h i c h ( a p p l y (X , 2 , v a r ) !=0 ) ] #e x c l u d e constant v a r i a b l e s
6 i n d e x . g e n d e r <− g r e p ( ” f e m a l e ” , c o l n a m e s (X) )
7 e f f e c t s . f e m a l e<− r l a s s o E f f e c t s ( x=X , y=y , i n d e x=i n d e x . g e n d e r )
8 summary ( e f f e c t s . f e m a l e )
77
h maybe redondant, it is the propensity score in TE litterature
78
Since E (y |X ) 6= g (X ), a ML fit of y on X is not a good estimate of g
Emmanuel Flachaire Causal Machine learning
Homogeneous treatment effects: Partially linear model
Partially Linear Regression model PLR model
y = dα + g (X ) + ε
d = h(X ) + η
α is the target parameter, g and h are nuisance functions
Double Residuals (DR):
1 ML of y on X : compute residuals η̂y = y − ĝ (X )
2 ML of d on X : compute residuals η̂d = d − ĥ(X )
3 OLS of η̂y on η̂d → α̂
An application of FWL, or partialling out, with ML methods
√
Robinson (1988) shows that with DR α̂ is n-consistent, even
if ĝ (X ) and ĥ(X ) are consistent at slower rates79
The role of DR is to immunize α̂ against ML estimates: α̂ is
based on residuals η̂y and η̂d , which are ⊥ to ĝ (X ) and ĥ(X )
79
Robinson considers kernel regression. Chernozukhov et al. (2018) pdf
establish that any ML method can be used, so long as it is n1/4 -consistent
Emmanuel Flachaire Causal Machine learning
The role of double residuals (orthogonalization)
Distribution of α̂ − α0
81
Chernozhukov et al. (2018) pdf and Chernozhukov et al. (2017) pdf
82
Individuals in the treatment groups were offered a cash bonus if they found
a job within some pre-specified period of time (qualification period), provided
that the job was retained for a specified duration
83
See vignette and Bach, Chernozhukov, Kurz, Spindler (2021) pdf
Emmanuel Flachaire Causal Machine learning
Application 3: ATE in a PLR model
1 l i b r a r y ( DoubleML )
2 l i b r a r y ( mlr3 )
3 # I n i t i a l i z a t i o n o f t h e Data−Backend
4 d a t a=f e t c h b o n u s ( r e t u r n t y p e=” d a t a . t a b l e ” )
5 y=” i n u i d u r 1 ”
6 d=” t g ”
7 x=c ( ” f e m a l e ” , ” b l a c k ” , ” o t h r a c e ” , ” dep1 ” , ” dep2 ” , ” q2 ” , ” q3 ” , ” q4 ” ,
” q5 ” , ” q6 ” , ” a g e l t 3 5 ” , ” a g e g t 5 4 ” , ” d u r a b l e ” , ” l u s d ” , ” hus d ” )
8 dml d a t a=DoubleMLData $new ( d at a , y c o l=y , d c o l s=d , x c o l s=x )
9 # I n i t i a l i z a t i o n o f t h e PLR Model
10 s e t . s e e d ( 3 1 4 1 5 ) #r e q u i r e d t o r e p l i c a t e s a m p l e s p l i t
11 l e a r n e r g=l r n ( ” r e g r . r a n g e r ” , num . t r e e s =500 , min . node . s i z e =2 ,
max . d e p t h =5) #Random F o r e s t from t h e r a n g e r p a c k a g e
12 l e a r n e r m=l r n ( ” r e g r . r a n g e r ” , num . t r e e s =500 , min . node . s i z e =2 ,
max . d e p t h =5)
13 dml p l r=DoubleMLPLR$new ( dml d ata ,
14 ml m = l e a r n e r m,
15 ml g = l e a r n e r g ,
16 s c o r e = ” p a r t i a l l i n g out ” ,
17 n f o l d s = 5 , n rep = 1)
18 # P e r f o r m t h e ATE e s t i m a t i o n and p r i n t t h e r e s u l t s
19 dml p l r $ f i t ( )
20 dml p l r $ summary ( )
37 E s t i m a t e s and s i g n i f i c a n c e t e s t i n g o f t h e e f f e c t o f t a r g e t
variables
38 E s t i m a t e . Std . E r r o r t v a l u e Pr ( >| t | )
39 tg −0.07345 0.03549 −2.069 0.0385 ∗
40 −−−
41 S i g n i f . codes : 0∗∗∗ 0.001 ∗∗ 0.01 ∗ 0 . 0 5 . 0 . 1 1
84
Wager and Athey (2018) pdf , Athey, Tibshirani and Wager (2019) pdf
85
RF√predictions are asymptotically unbiased and Gaussian, but cv rates
below n and they do not account for the uncertainty due to sample splitting
Emmanuel Flachaire Causal Machine learning
Detection and analysis of heterogeneity: Generic ML
Main interests:
Test if there is evidence of heterogeneity (BLP)
86
Chernozhukov, Demirer, Duflo and Fernàndez-Val (2020) pdf
87
We randomly split the sample M times. The parameter estimates,
confidence bounds, and p-values reported are the medians across M splits.
88
CATEi = E[y1 − y0 |Xi ] = m(1, Xi ) − m(0, Xi )
Emmanuel Flachaire Causal Machine learning
Causal Machine Learning: A brief roadmap
Strittmatter and Wunsch (2021) The gender pay gap revisited with
big data: Do methodological choices matter?
Trimming in experiments vs. decomposition methods
→ Beware of CSC when moving away from RCT framework
Emmanuel Flachaire Causal Machine learning
Conclusion