Harsh It
Harsh It
Harsh It
Overf itting/ High Variance: The model picks up noise in the data, performs well on the
Traning Data for poor generalisation.
Solving Overfitting
1. : Penalize models for excessive complexity as
Regularization models tend tosimpler
evaluate how well your chosen model performs. (Often and sets are
validation testing
combined.)
Tuning the model: k-f old-cross-validation
Algorithm:
1. Divide the data into roughly equal subsets (folds) and label them by
k
Notice: We test on the fold, we didn't use for training so it'll give us an idea of
out-of -sample performance. Even if there is no , it is prudent to use
tuning parameter
cross-validation to report .
goodness-of -f it
Another usecase: When dataset is small and you don't want to split it into , Training
excuse that we had datasets. But now larger datasets have become available so
small
classification problem but these can draw only Linear decision boundary! We want to
build a .
non-linear classf ier
Trees tend to work well for problems where there are important nonlinearities and
interactions.
The paper mentions "The rules fits the data reasonably well, misclassifying about
1046
30
Pruning
We can keep branching the further and get better
tree but this will
Training Accuracy
simply .
overf it
Solution to is to add
overf itting cost to complexity
One measure for complexity in a tree: the number of and other is the
leaves of the
depth
tree
Typically chosen using .
10-f old-cross-validation
"Some researchers recommend being a bit more aggressive and advocate choosing the
complexity parameter that is one standard deviation lower than the loss-minimizing
value."
Statistical Method: Conditional Inference tree - ctree
Logistic Regression:
Result of logistic regression: The coefficient of race showed a statistically signif icantly
negative impact on probability of getting a mortgage for black applicants that later
prompted considerable debate and discussion.
Ctree:
2, 380 observations of predictors , used package .
12 R party
When this is not used as a feature to construct the , accuracy doesn’t change
race ctree
at all.
But it's possible that there is
racial discrimination elsewhere in the mortgage process,
or that some of the variables included are highly correlated with race.
Boosting Bagging Bootstrap
Adding randomness turns out to be a helpful way of dealing with the overfitting
problem.
Bootstrap : Choosing (with replacement) a sample of size from a dataset to estimate
n
Random Forests
Algorithm
1. Choose a bootstrapsample of the observations and start to grow a tree.
2. At each node of the tree, choose a random sample of the predictors to make the next
decision. Do not prune the trees.
3. Repeat this process many times to grow a forest of trees.
4. In order to determine the classification of a new observation, have each tree make a
classification and use a majority vote for the final prediction
This method produces surprisingly good out-of-sample fits, particularly with highly
nonlinear data.
Howard and Bowles (2012) claim “ensembles of decision trees (often known as
‘Random Forests’) have been the most successful general-purpose algorithm in modern
times.
There are a number of variations and extensions of the basic “ensemble of trees” model
such as Friedman’s “Stochastic Gradient Boosting” (Friedman 2002)
To replicate the experiments:
The datasets and code of Author can be found here.
R