Chapter 09 CART-3

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 42

Chapter 9 – Classification and

Regression Trees

Machine Learning for


Business Analytics in
RapidMiner
Shmueli, Bruce, Deokar & Patel
The contributions of Sambit Tripathi are gratefully
acknowledged
© Galit Shmueli, Peter Bruce and Amit Deokar 2023
Trees and Rules
Goal: Classify or predict an outcome based
on a set of predictors
The output is a set of rules
Example:
●Goal: classify a record as “will accept
credit card offer” or “will not accept”
●Rule might be “IF (Income >= 106) AND
(Education < 1.5) AND (Family <= 2.5)
THEN Class = 0 (nonacceptor)
●Also called CART, Decision Trees, or just
Trees
●Rules are represented by tree diagrams
Example Tree: Classify Classify Bank
Customers as Loan Acceptors Y/N
• Terminal (Leaf) Nodes –
acceptor/nonacceptor
• Splitting Node – Predictor
name with split condition
• Decision Rules:
• IF (CCAvg <= 8.9) AND
(Mortgage > 29.5) AND
(Mortgage <= 626) AND
(Family > 1.5) THEN
Class = acceptor
• Classifying a new record:
• Dropped down the
tree
• When at leaf node,
assign class by majority
vote of records in that
leaf node
How Is the Tree Produced?
Recursive partitioning: Repeatedly split
the records into two parts so as to
achieve maximum homogeneity of
outcome within each new part

Stopping Tree Growth: A fully grown


tree is too complex and will overfit
Recursive Partitioning
Recursive Partitioning Steps
●Pick one of the predictor variables, xi
●Pick a value of xi, say si, that divides the training
data into two (not necessarily equal) portions
●Measure how “pure” or homogeneous each of
the resulting portions is
“Pure” = containing records of mostly one class (or, for
prediction, records with similar outcome values)
●Algorithm tries different values of xi, and si to
maximize purity in initial split
●After you get a “maximum purity” split, repeat
the process for a second split (on any variable),
and so on
Example: Riding Mowers

●Goal: Classify 24 households as owning or


not owning riding mowers
●Predictors = Income, Lot Size
How to split
●Order records according to one variable, say
income
●Take a predictor value, say 59.7 (the first record)
and divide records into those with income >=
59.7 and those < 59.7
●Measure resulting purity (homogeneity) of class
in each resulting portion
●Try all other split values
●Repeat for other variable(s)
●Select the one variable & split that yields the
most purity
Note: Categorical Variables
●Examine all possible ways in which the
categories can be split.

●E.g., categories A, B, C can be split 3 ways


{A} and {B, C}
{B} and {A, C}
{C} and {A, B}

●With many categories, # of splits becomes


huge
84.75
Mostly non-owner to left, mostly owner to
right
Second Split: Lot size = 21.4
After All Splits
Measuring Impurity
Gini Index
Gini Index for rectangle A

p = proportion of cases in rectangle A that


belong to class k (out of m classes)

●I(A) = 0 when all cases belong to same class


●Max value when all classes are equally
represented (= 0.50 in binary case)
Entropy

p = proportion of cases in rectangle A


that belong to class k (out of m
classes)

●Entropy ranges between 0 (most pure)


and log2(m) (equal representation of
classes)
Calculating Impurity - Example
Impurity and Recursive
Partitioning

●Obtain overall impurity measure (weighted avg.


of individual rectangles)

●At each successive stage, compare this measure


across all possible splits in all variables

●Choose the split that reduces impurity the most

●Chosen split points become nodes on the tree


Splitting in RapidMiner

After first split:


Tree After 3 Splits

The first split is on Income, then the next split is on Lot Size, then the third split
is on Income again
Tree After All Splits
(Fully grown tree)

What is the classification for


Income <= 84.75 and Lot_Size >
21.2?

Answer: “Owner”
Overfitting

●Full trees are complex and overfit the data


●Natural end of process is 100% purity in
each leaf
●This overfits the data, which end up fitting
noise in the data
●Consider Example 2, Loan Acceptance with
more records and more variables than the
Riding Mower data – the full tree is very
complex
Full trees are too complex – they end up
fitting noise, overfitting the data
• Universal Bank loan example - full tree shown below
• Goal: Model customer behavior to predict the acceptance of
a personal loan
• Data: 5000 records
• Input: Customer demographic information (age, income, etc),
Customer bank relationship
• Output: Response to personal loan campaign (yes/no)
RapidMiner Process
Results
Full tree is 100% accurate on training data.
Overfitting & instability produce poor predictive
performance – past a certain point in tree
complexity, the error rate on new data starts to
increase
Trees can also
be unstable

● If 2 or more variables are of roughly equal


importance, which one CART chooses for the first
split can depend on the initial partition into
training and validation
● A different partition into training/validation could
lead to a different initial split
● This can cascade down and produce a very
different tree from the first training/validation
partition
● Solution is to try many different
training/validation splits – “cross validation”
Limiting tree complexity - grid
search

• Parameter Tuning
• Perform grid search over combinations of
parameter values
• Find the combination that leads to lowest error
(or highest accuracy)
• Use cross validation to find the optimal
parameter values
Use “Optimize Parameters”
Using
“Optimize
Parameters,”
cont.
Another way to limit tree complexity -
pruning

● Let the tree grow on training data


● Cut off (prune) branches that do not
sufficiently improve the classification error
● RapidMiner uses “pessimistic error” for a
decision node:
= e +L(T)/2n
e is fraction misclassified in node, n is # records,
and L(T) is # leafs in subtree T (the pruning
candidate)
● Subtree is pruned unless it improves error by
one std. dev. over the adjusted error of
decision node
Regression Trees
Regression Tree is Similar,
Except...
●Prediction is computed as the average of
numerical target variable in the rectangle
(in CT it is majority vote)

●Impurity measured by sum of squared


deviations from leaf mean

●Performance measured by RMSE (root


mean squared error)
Ensemble Trees
• Predictions from many trees are combined, harnessing the
“wisdom of the crowd”
• Very good predictive performance, better than single trees
(often the top choice for predictive modeling)
• Cost: Loss of rules you can explain implement (since you
are dealing with many trees, not a single tree)
• However, Random Forest does produce “variable
importance scores,” (using information about how predictors
reduce Gini scores over all the trees in the forest)
Ensemble Variants
Random Forest
• Draw multiple samples, with replacement, from data
(sampling is called bootstrap)
• Using a random subset of predictors at each stage,
fit a decision tree to each sample
• Combine the predictions from individual trees to
obtain improved predictions.
• Use voting for classification and averaging for
regression

Use Random Forest operator in RapidMiner


Ensemble Variants
Boosted Trees
Uses an iterative approach in which each successive tree focuses its
attention on the misclassified trees from the prior tree.
1. Fit a single tree
2. Draw a bootstrap sample of records with higher selection
probability for misclassified records
3. Fit a new tree to a new bootstrap sample
4. Repeat steps 2 & 3 multiple times
5. Use weighted voting (classification) or averaging (prediction) with
heavier weights for later trees

The idea: focus the learning process on the hard-to-classify records

Use Gradient Boosted Trees operator in RapidMiner


Random
Forest -
RapidMiner
Process
(after data
loading and
preprocessing)
Boosted Trees

● Especially useful for the “rare case” scenario


(suppose 1’s are the rare class)
● With simple classifiers, it can be hard for a
“1” to “break out” from the dominant
classification, & many get misclassified
● Up-weighting them focuses the tree fitting on
the 1’s, and reduces the dominating effect of
the 0’s
Boosted Trees - RapidMiner
Process
(after data loading and preprocessing)
Advantages of trees
●Easy to use, understand
●Single trees produce rules that are easy to
interpret & implement
●Variable selection & reduction is automatic
●Do not require the assumptions of statistical
models
●Can work without extensive handling of
missing data

Disadvantage of single trees: instability and


poor predictive performance
Summary
●Classification and Regression Trees are an
easily understandable and transparent
method for predicting or classifying new
records
●A single tree is a graphical representation
of a set of rules
●Tree growth must be stopped to avoid
overfitting of the training data
●Ensembles (random forests, boosting)
improve predictive performance, but you
lose interpretability and the rules embodied
in a single tree

You might also like