XGBoost Tuning 1597155827
XGBoost Tuning 1597155827
XGBoost Tuning 1597155827
. . .
To overcome this, Tianqi Chen and Carlos Guestrin built A Scalable Tree Boosting
System — XGBoost can be thought of as Gradient Boosting on steroids. It features
parallelized tree building, cache-aware access, sparsity awareness, regularization, and
weighted quantile sketch as some of its systems optimization and algorithmic
enhancements.
To walk you through XGBoost and its hyperparameters, we’ll build a simple
classification model using the Fashion MNIST dataset.
• 0 T-shirt/top
• 1 Trouser
• 2 Pullover
• 3 Dress
• 4 Coat
• 5 Sandal
• 6 Shirt
• 7 Sneaker
• 8 Bag
• 9 Ankle boot
Each instance (image) in the dataset has 784 features (one for each pixel) and the
value in each feature ranges from 0 to 255, hence we’ll use Scikit-Learn’s StandardScaler
to rescale the values to a smaller range with mean zero and unit variance.
Baseline Model
We are going to use an XGBoostClassifier to predict the label of a given image. Building
an XGBoostClassifier is pretty straightforward. We’ll start with a simple baseline model
and move on from there.
For this first iteration, we’ll only specify one hyperparameter: objective and set it to
“multi:softmax” . The objective is part of the Learning Task hyperparameters, and it
specifies the learning task (regression, classification, ranking, etc) and function to be
used.
Let’s backtrack for a second. What are hyperparameters? — They are the parameters
that are initialized before training a model because they cannot be learned from the
algorithm. They control the behavior of the training algorithm and have a high impact on
the performance of a model. The typical metaphor goes like this: hyperparameters are
the knobs one turns to tweak a machine learning model. They are essential to
optimization and to improve evaluation metrics.
And now, let’s take a look at our model and the results.
1 import xgboost as xgb
2 # Create XGB Classifier object
3 xgb_clf = xgb.XGBClassifier(objective = "multi:softmax")
4 # Fit model
5 xgb_model = xgb_clf.fit(X_train, target_train)
6 # Predictions
7 y_train_preds = xgb_model.predict(X_train)
8 y_test_preds = xgb_model.predict(X_test)
9 # Print F1 scores and Accuracy
10 print("Training F1 Micro Average: ", f1_score(target_train, y_train_preds, average = "micro"
11 print("Test F1 Micro Average: ", f1_score(target_test, y_test_preds, average = "micro"))
12 print("Test Accuracy: ", accuracy_score(target_test, y_test_preds))
Not too shabby! But we can try to beat these first scores with some tweaking and knob-
turning.
The million dollar question, in this case, is what to tune and how? There are no
benchmarks as to what the ideal hyperparameters are since these will depend on your
specific problem, your data, and what you’re optimizing for. But once you understand
the concepts of even the most obscure hyperparameters you’ll be well on your way to
tuning like a boss!
We start by creating an XGBClassifier object, just as with the baseline model, we are
passing the hyperparameters tree_method , predictor , verbosity and eval_metric , in addition
to the objective , directly rather than through the parameter grid. The first two allow us to
access GPU capabilities directly, verbosity gives us visibility as to what the model is
running in real-time, and eval_metric is the evaluation metric to be used for validation
data — as you can see, you can pass multiple evaluation metrics in the form of a Python
list .
Unfortunately, using tree_method and predictor we kept getting the same error over and
over. You can track the status of this error here.
Kernel error:
In: /workspace/include/xgboost/./../../src/common/span.h, line: 489
T &xgboost::common::Span<T, Extent>::operator[](long) const [with T =
xgboost::detail::GradientPairInternal<float>, Extent = -1L]
Expecting: _idx >= 0 && _idx < size()
terminate called after throwing an instance of 'thrust::system::system_error'
what(): function_attributes(): after cudaFuncGetAttributes: unspecified launch failure
Given that we couldn’t fix the root issue (pun intended), the model had to be run on
CPU. So the final code for our RandomizedSearchCV looked like this:
Our accuracy score went up by 2%! Not bad for a model that was running for over 60
hours!
• learning_rate : to start, let’s clarify that this learning rate is not the same as in
gradient descent. In the case of gradient boosting, the learning rate is meant to
lessen the effect of each additional tree to the model. In their paper, A Scalable Tree
lessen the effect of each additional tree to the model. In their paper, A Scalable Tree
Boosting System Tianqi Chen and Carlos Guestrin refer to this regularization
technique as shrinkage, and it is an additional method to prevent overfitting. The
lower the learning rate, the more robust the model will be in preventing overfitting.
• gamma : mathematically, this is known as the Lagrangian Multiplier, and its purpose
is complexity control. It is a pseudo-regularization term for the loss function; and it
represents by how much the loss has to be reduced when considering a split, in
order for that split to happen.
• max_depth : refers to the depth of a tree. It sets the maximum number of nodes that
can exist between the root and the farthest leaf. Remember that deeper trees are
prone to overfitting.
But wait! There’s more. There are other hyperparameters you can modify, depending
on the problem you are trying to solve or what you’re trying to optimize for:
• booster : allows you to choose which booster to use: gbtree , gblinear or dart .
We’ve been using gbtree , but dart and gblinear also have their own additional
hyperparameters to explore.
• scale_pos_weight : balances between negative and positive weights, and should
definitely be used in cases where the data present high class imbalance.
• base_score : global bias. This parameter is useful when dealing with high class
imbalance.
• max_delta_step : sets the maximum absolute value possible for the weights. Also
useful when dealing with unbalanced classes.
. . .
278
claps