Harsh It

Summary of Big Data: New Tricks for
Econometrics (Hal R. Varian)

Prepared by Harshit Goyal (2021MT10143)
Some terms:
x is also called in this paper.
predictor or f eature or explanatory variables
Overf itting/ High Variance: The model picks up noise in the data, performs well on the
Traning Data for poor generalisation.
Overfitting and Underfitting
General Considerations for Prediction

Our goal is to get good predictions i.e. the model
out-of -sample generalises well or
simply, it makes predictions on unseen examples.
good
One example: linearly independent
n will fit observations perfectly but will
regressors n
usually have poor performance. Linearly Independent equations in

out-of -sample n n
variables (the regression coefficients here)
Solving Overfitting
1. : Penalize models for excessive complexity as
Regularization models tend tosimpler
work better for forecasts.

out-of -sample
2. An numeric measure of model complexity: Hyperparametrs. For example, the

explicit
degree of the polynomial you want to fit your data into.

3. Splitting the dataset into . Use the
Training, Validation and Testing data to training
estimate a model, the to choose your model, and the

validation data to testing data
evaluate how well your chosen model performs. (Often and sets are
validation testing
combined.)
Tuning the model: k-f old-cross-validation
Algorithm:
1. Divide the data into roughly equal subsets (folds) and label them by
k
. Start with subset .

s = 1, . . . , k s = 1
2. Pick a value for the tuning parameter.

3. Fit your model using the subsets other than subset .
k − 1 s
4. Predict for subset and measure the associated loss.

s
5. Stop if , otherwise increment by and go to step

s = k s 1 2
Notice: We test on the fold, we didn't use for training so it'll give us an idea of
out-of -sample performance. Even if there is no , it is prudent to use
tuning parameter
cross-validation to report .
goodness-of -f it
Common choices for are , , and k 10 5 or

Sample Size - 1 .
“leave one out”
Another usecase: When dataset is small and you don't want to split it into , Training
Validation and and

Testing it.waste
For many years, economists have reported measures using the

in-sample goodness-of -f it
excuse that we had datasets. But now larger datasets have become available so
small
it's good to split the dataset.

Classification and Regression Trees
Economists would typically use a generalized linear model like a or
logit for a
probit
classification problem but these can draw only Linear decision boundary! We want to
build a .
non-linear classf ier
Trees tend to work well for problems where there are important nonlinearities and
interactions.
Decision Tree Partition Plot

Above tree is constructed using just features and of travel.
2 age class
The Class or mentioned on the

lived died is the majority class.
leaves
Let's see how to interpret it and make a prediction:

age: 34 , class: 1 → predicted-outcome: lived
T ranining Accuracy: 723

%
= 69.12
The paper mentions "The rules fits the data reasonably well, misclassifying about
1046
30
percent of the observations in the testing set"

Decision Tree vs Logistic Regression
By constructing a tree relatingsurvival to the rule generated is
age
"survive if age < 8.5 year" while produces:

Logistic Regression
Logistic Regression of Survival versus Age

Logistic model says is barely important.
age
This is resolved as:

In the below diagram, Survival rates for the youngest passengers were relatively high,
and survival rates for older passengers were relatively low. So what mattered for survival
is not so much age, but whether the passenger was a child or elderly. It would be
diffificult to discover this pattern from a logistic regression alone.
Titanic Survival Rates by Age Group
T rees also handle missing data well.
Perlich, Provost, and Simonof f (2003) examined several standard datasets and found that
“logistic regression is better f or smaller data sets and tree induction f or larger data sets.”
Pruning
We can keep branching the further and get better
tree but this will
Training Accuracy
simply .
overf it
Solution to is to add
overf itting cost to complexity
One measure for complexity in a tree: the number of and other is the
leaves of the
depth
tree
Typically chosen using .
10-f old-cross-validation
"Some researchers recommend being a bit more aggressive and advocate choosing the
complexity parameter that is one standard deviation lower than the loss-minimizing
value."
Statistical Method: Conditional Inference tree - ctree
ctreechooses the structure of the tree using a sequence of hypothesis tests

The resulting trees tend to need very little pruning (Hothorn, Hornik, and Zeileis 2006)
Conditional Inference Tree

subsp : number of siblings plus spouse aboard.
One might summarize this tree by the following principle: “women and children first . . .
particularly if they were traveling first . . class"
An Economic Example Using Home Mortgage Disclosure Act

Data
Question:
"If race played a signif icant role in determining who was approved f or a mortgage?"
Logistic Regression:
Result of logistic regression: The coefficient of race showed a statistically signif icantly
negative impact on probability of getting a mortgage for black applicants that later
prompted considerable debate and discussion.
Ctree:
2, 380 observations of predictors , used package .
12 R party
Model # of misclasified examples Error rate

Logistic Regression 228 % 9.6
ctree 225 % 9.5
Conditional Inference Tree for House Mortage data

dmi = denied mortgage insurance : this variable alone explains much of the variation in the
data. See the figure
The race variable (“black”) shows up far down the tree
So how to infer if is decisive?
race
When this is not used as a feature to construct the , accuracy doesn’t change
race ctree
at all.
But it's possible that there is
racial discrimination elsewhere in the mortgage process,
or that some of the variables included are highly correlated with race.
Boosting Bagging Bootstrap
Adding randomness turns out to be a helpful way of dealing with the overfitting
problem.
Bootstrap : Choosing (with replacement) a sample of size from a dataset to estimate
n
the sampling distribution of some statistic. A variation is out of bootstrap

m n (n > m) .
Bagging : Averaging across models estimated with several different bootstrap samples in
order to improve the performance of an estimator.
Boosting : Repeated estimation where misclassified observations are given increasing
weight in each repetition. The final estimate is then a vote or an average across the
repeated estimates.
Econometricians are well-acquainted with the bootstrap but rarely use the other two
methods
Random Forests
Algorithm
1. Choose a bootstrapsample of the observations and start to grow a tree.
2. At each node of the tree, choose a random sample of the predictors to make the next
decision. Do not prune the trees.
3. Repeat this process many times to grow a forest of trees.
4. In order to determine the classification of a new observation, have each tree make a
classification and use a majority vote for the final prediction
This method produces surprisingly good out-of-sample fits, particularly with highly
nonlinear data.
Howard and Bowles (2012) claim “ensembles of decision trees (often known as
‘Random Forests’) have been the most successful general-purpose algorithm in modern
times.
There are a number of variations and extensions of the basic “ensemble of trees” model
such as Friedman’s “Stochastic Gradient Boosting” (Friedman 2002)
To replicate the experiments:
The datasets and code of Author can be found here.
R
Build Decision Trees and Random Forests.

References
Images:
Overfitting and Underfitting : https://in.mathworks.com/discovery/overfitting.html
All other images taken from Big Data: New Tricks for Econometrics

Harsh It

Uploaded by

Copyright:

Available Formats

Harsh It

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Harsh It

Uploaded by

Copyright:

Available Formats

Summary of Big Data: New Tricks for

Econometrics (Hal R. Varian)

Overfitting and Underfitting

General Considerations for Prediction

usually have poor performance. Linearly Independent equations in

variables (the regression coefficients here)

work better for forecasts.

2. An numeric measure of model complexity: Hyperparametrs. For example, the

degree of the polynomial you want to fit your data into.

estimate a model, the to choose your model, and the

. Start with subset .

2. Pick a value for the tuning parameter.

4. Predict for subset and measure the associated loss.

5. Stop if , otherwise increment by and go to step

Common choices for are , , and k 10 5 or

Validation and and

For many years, economists have reported measures using the

it's good to split the dataset.

Decision Tree Partition Plot

The Class or mentioned on the

Let's see how to interpret it and make a prediction:

T ranining Accuracy: 723

percent of the observations in the testing set"

"survive if age < 8.5 year" while produces:

Logistic Regression of Survival versus Age

This is resolved as:

ctreechooses the structure of the tree using a sequence of hypothesis tests

Conditional Inference Tree

An Economic Example Using Home Mortgage Disclosure Act

Model # of misclasified examples Error rate

ctree 225 % 9.5

Conditional Inference Tree for House Mortage data

the sampling distribution of some statistic. A variation is out of bootstrap

Build Decision Trees and Random Forests.

You might also like