Harsh It

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Summary of Big Data: New Tricks for

Econometrics (Hal R. Varian)


Prepared by Harshit Goyal (2021MT10143)
Some terms:
x is also called in this paper.
predictor or f eature or explanatory variables

Overf itting/ High Variance: The model picks up noise in the data, performs well on the
Traning Data for poor generalisation.

Overfitting and Underfitting

General Considerations for Prediction


Our goal is to get good predictions i.e. the model
out-of -sample generalises well or
simply, it makes predictions on unseen examples.
good
One example: linearly independent
n will fit observations perfectly but will
regressors n

usually have poor performance. Linearly Independent equations in


out-of -sample n n

variables (the regression coefficients here)

Solving Overfitting
1. : Penalize models for excessive complexity as
Regularization models tend tosimpler

work better for forecasts.


out-of -sample

2. An numeric measure of model complexity: Hyperparametrs. For example, the


explicit

degree of the polynomial you want to fit your data into.


3. Splitting the dataset into . Use the
Training, Validation and Testing data to training

estimate a model, the to choose your model, and the


validation data to testing data

evaluate how well your chosen model performs. (Often and sets are
validation testing

combined.)
Tuning the model: k-f old-cross-validation

Algorithm:
1. Divide the data into roughly equal subsets (folds) and label them by
k

. Start with subset .


s = 1, . . . , k s = 1

2. Pick a value for the tuning parameter.


3. Fit your model using the subsets other than subset .
k − 1 s

4. Predict for subset and measure the associated loss.


s

5. Stop if , otherwise increment by and go to step


s = k s 1 2

Notice: We test on the fold, we didn't use for training so it'll give us an idea of
out-of -sample performance. Even if there is no , it is prudent to use
tuning parameter

cross-validation to report .
goodness-of -f it

Common choices for are , , and k 10 5 or


Sample Size - 1 .
“leave one out”

Another usecase: When dataset is small and you don't want to split it into , Training

Validation and and


Testing it.waste

For many years, economists have reported measures using the


in-sample goodness-of -f it

excuse that we had datasets. But now larger datasets have become available so
small

it's good to split the dataset.


Classification and Regression Trees
Economists would typically use a generalized linear model like a or
logit for a
probit

classification problem but these can draw only Linear decision boundary! We want to
build a .
non-linear classf ier

Trees tend to work well for problems where there are important nonlinearities and
interactions.

Decision Tree Partition Plot


Above tree is constructed using just features and of travel.
2 age class

The Class or mentioned on the


lived died is the majority class.
leaves

Let's see how to interpret it and make a prediction:


age: 34 , class: 1 → predicted-outcome: lived

T ranining Accuracy: 723


%
= 69.12

The paper mentions "The rules fits the data reasonably well, misclassifying about
1046

30

percent of the observations in the testing set"


Decision Tree vs Logistic Regression
By constructing a tree relatingsurvival to the rule generated is
age

"survive if age < 8.5 year" while produces:


Logistic Regression

Logistic Regression of Survival versus Age


Logistic model says is barely important.
age

This is resolved as:


In the below diagram, Survival rates for the youngest passengers were relatively high,
and survival rates for older passengers were relatively low. So what mattered for survival
is not so much age, but whether the passenger was a child or elderly. It would be
diffificult to discover this pattern from a logistic regression alone.
Titanic Survival Rates by Age Group
T rees also handle missing data well.
Perlich, Provost, and Simonof f (2003) examined several standard datasets and found that
“logistic regression is better f or smaller data sets and tree induction f or larger data sets.”

Pruning
We can keep branching the further and get better
tree but this will
Training Accuracy

simply .
overf it

Solution to is to add
overf itting cost to complexity

One measure for complexity in a tree: the number of and other is the
leaves of the
depth

tree
Typically chosen using .
10-f old-cross-validation

"Some researchers recommend being a bit more aggressive and advocate choosing the
complexity parameter that is one standard deviation lower than the loss-minimizing
value."
Statistical Method: Conditional Inference tree - ctree

ctreechooses the structure of the tree using a sequence of hypothesis tests


The resulting trees tend to need very little pruning (Hothorn, Hornik, and Zeileis 2006)

Conditional Inference Tree


subsp : number of siblings plus spouse aboard.
One might summarize this tree by the following principle: “women and children first . . .
particularly if they were traveling first . . class"

An Economic Example Using Home Mortgage Disclosure Act


Data
Question:
"If race played a signif icant role in determining who was approved f or a mortgage?"

Logistic Regression:
Result of logistic regression: The coefficient of race showed a statistically signif icantly

negative impact on probability of getting a mortgage for black applicants that later
prompted considerable debate and discussion.
Ctree:
2, 380 observations of predictors , used package .
12 R party

Model # of misclasified examples Error rate


Logistic Regression 228 % 9.6

ctree 225 % 9.5

Conditional Inference Tree for House Mortage data


dmi = denied mortgage insurance : this variable alone explains much of the variation in the
data. See the figure
The race variable (“black”) shows up far down the tree
So how to infer if is decisive?
race

When this is not used as a feature to construct the , accuracy doesn’t change
race ctree

at all.
But it's possible that there is
racial discrimination elsewhere in the mortgage process,
or that some of the variables included are highly correlated with race.
Boosting Bagging Bootstrap
Adding randomness turns out to be a helpful way of dealing with the overfitting
problem.
Bootstrap : Choosing (with replacement) a sample of size from a dataset to estimate
n

the sampling distribution of some statistic. A variation is out of bootstrap


m n (n > m) .
Bagging : Averaging across models estimated with several different bootstrap samples in
order to improve the performance of an estimator.
Boosting : Repeated estimation where misclassified observations are given increasing
weight in each repetition. The final estimate is then a vote or an average across the
repeated estimates.
Econometricians are well-acquainted with the bootstrap but rarely use the other two
methods

Random Forests
Algorithm
1. Choose a bootstrapsample of the observations and start to grow a tree.
2. At each node of the tree, choose a random sample of the predictors to make the next
decision. Do not prune the trees.
3. Repeat this process many times to grow a forest of trees.
4. In order to determine the classification of a new observation, have each tree make a
classification and use a majority vote for the final prediction

This method produces surprisingly good out-of-sample fits, particularly with highly
nonlinear data.
Howard and Bowles (2012) claim “ensembles of decision trees (often known as
‘Random Forests’) have been the most successful general-purpose algorithm in modern
times.
There are a number of variations and extensions of the basic “ensemble of trees” model
such as Friedman’s “Stochastic Gradient Boosting” (Friedman 2002)
To replicate the experiments:
The datasets and code of Author can be found here.
R

Build Decision Trees and Random Forests.


References
Images:
Overfitting and Underfitting : https://in.mathworks.com/discovery/overfitting.html
All other images taken from Big Data: New Tricks for Econometrics

You might also like