Machinelearningmastery Com Bagging Ensemble With Different D

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Develop a Bagging Ensemble with Different Data

Transformations
by Jason Brownlee on April 27, 2021 in Ensemble Learning  8

Bootstrap aggregation, or bagging, is an ensemble where each model is trained on a different sample
of the training dataset.

The idea of bagging can be generalized to other techniques for changing the training dataset and
fitting the same model on each changed version of the data. One approach is to use data transforms
that change the scale and probability distribution of input variables as the basis for the training of
contributing members to a bagging-like ensemble. We can refer to this as data transform bagging or a
data transform ensemble.

In this tutorial, you will discover how to develop a data transform ensemble.

After completing this tutorial, you will know:

Data transforms can be used as the basis for a bagging-type ensemble where the same
model is trained on different views of a training dataset.
How to develop a data transform ensemble for classification and confirm the ensemble
performs better than any contributing member.
How to develop and evaluate a data transform ensemble for regression predictive modeling.

Kick-start your project with my new book Ensemble Learning Algorithms With Python, including
step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
Develop a Bagging Ensemble with Different Data Transformations
Photo by Maciej Kraus, some rights reserved.

Tutorial Overview
This tutorial is divided into three parts; they are:

1. Data Transform Bagging


2. Data Transform Ensemble for Classification
3. Data Transform Ensemble for Regression

Data Transform Bagging


Bootstrap aggregation, or bagging for short, is an ensemble learning technique based on the idea of
fitting the same model type on multiple different samples of the same training dataset.

The hope is that small differences in the training dataset used to fit each model will result in small
differences in the capabilities of models. For ensemble learning, this is referred to as diversity of
ensemble members and is intended to de-correlate the predictions (or prediction errors) made by
each contributing member.

Although it was designed to be used with decision trees and each data sample is made using the
bootstrap method (selection with rel-selection), the approach has spawned a whole subfield of study

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
with hundreds of variations on the approach.

We can construct our own bagging ensembles by changing the dataset used to train each
contributing member in new and unique ways.

One approach would be to apply a different data preparation transform to the dataset for each
contributing ensemble member.

This is based on the premise that we cannot know the representational form for a training dataset that
exposes the unknown underlying structure to the dataset to the learning algorithms. This motivates
the need to evaluate models with a suite of different data transforms, such as changing the scale and
probability distribution, in order to discover what works.

This approach can be used where a suite of different transforms of the same training dataset is
created, a model trained on each, and the predictions combined using simple statistics such as
averaging.

For lack of a better name, we will refer to this as “Data Transform Bagging” or a “Data Transform
Ensemble.”

There are many transforms that we can use, but perhaps a good starting point would be a selection
that changes the scale and probability distribution, such as:

Normalization (fixed range)


Standardization (zero mean)
Robust Standardization (robust to outliers)
Power Transform (remove skew)
Quantile Transform (change distribution)
Discretization (k-bins)

The approach is likely to be more effective when used with a base model that trains different or very
different models based on the effects of the data transform.

Changing the scale of the distribution may only be appropriate with models that are sensitive to
changes in the scale of input variables, such as those that calculate a weighted sum, such as logistic
regression and neural networks, and those that use distance measures, such as k-nearest neighbors
and support vector machines.

Changes to the probability distribution for input variables would likely impact most machine learning
models.

Now that we are familiar with the approach, let’s explore how we can develop a data transform
ensemble for classification problems.

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
Want to Get Started With Ensemble Learning?
Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Data Transform Ensemble for Classification


We can develop a data transform approach to bagging for classification using the scikit-learn library.

The library provides a suite of standard transforms that we can use directly. Each ensemble member
can be defined as a Pipeline, with the transform followed by the predictive model, in order to avoid
any data leakage and, in turn, produce optimistic results. Finally, a voting ensemble can be used to
combine the predictions from each pipeline.

First, we can define a synthetic binary classification dataset as the basis for exploring this type of
ensemble.

The example below creates a dataset with 1,000 examples each comprising 20 input features, where
15 of them contain information for predicting the target.

1 # synthetic classification dataset


2 from sklearn.datasets import make_classification
3 # define dataset
4 X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, ra
5 # summarize the dataset
6 print(X.shape, y.shape)

Running the example will create the dataset and summarizes the shape of the data arrays, confirming
our expectations.

1 (1000, 20) (1000,)

Next, we establish a baseline on the problem using the predictive model we intend to use in our
ensemble. It is standard practice to use a decision tree in bagging ensembles, so in this case, we will
use the DecisionTreeClassifier with default hyperparameters.

We will evaluate the model using standard practices, in this case, repeated stratified k-fold cross-
validation with three repeats and 10 folds. The performance will be reported using the mean of the
classification accuracy across all folds and repeats.

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
The complete example of evaluating a decision tree on the synthetic classification dataset is listed
below.

1 # evaluate decision tree on synthetic classification dataset


2 from numpy import mean
3 from numpy import std
4 from sklearn.datasets import make_classification
5 from sklearn.model_selection import cross_val_score
6 from sklearn.model_selection import RepeatedStratifiedKFold
7 from sklearn.tree import DecisionTreeClassifier
8 # define dataset
9 X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, r
10 # define the model
11 model = DecisionTreeClassifier()
12 # define the evaluation procedure
13 cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
14 # evaluate the model
15 n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
16 # report performance
17 print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean classification accuracy of the decision tree on the synthetic
classification dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or
differences in numerical precision. Consider running the example a few times and compare the
average outcome.

In this case, we can see that the model achieved a classification accuracy of about 82.3 percent.

This score provides a baseline in performance from which we expect a data transform ensemble to
improve upon.

1 Mean Accuracy: 0.823 (0.039)

Next, we can develop an ensemble of decision trees, each fit on a different transform of the input
data.

First, we can define each ensemble member as a modeling pipeline. The first step will be the data
transform and the second will be a decision tree classifier.

For example, the pipeline for a normalization transform with the MinMaxScaler class would look as
follows:

1 ...
2 # normalization
3 norm = Pipeline([('s', MinMaxScaler()), ('m', DecisionTreeClassifier())])

We can repeat this for each transform or transform configuration that we want to use and add all of
the model pipelines to a list.

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
The VotingClassifier class can be used to combine the predictions from all of the models. This class
takes an “estimators” argument that is a list of tuples where each tuple has a name and the model or
modeling pipeline. For example:

1 ...
2 # normalization
3 norm = Pipeline([('s', MinMaxScaler()), ('m', DecisionTreeClassifier())])
4 models.append(('norm', norm))
5 ...
6 # define the voting ensemble
7 ensemble = VotingClassifier(estimators=models, voting='hard')

To make the code easier to read, we can define a function get_ensemble() to create the members and
data transform ensemble itself.

1 # get a voting ensemble of models


2 def get_ensemble():
3 # define the base models
4 models = list()
5 # normalization
6 norm = Pipeline([('s', MinMaxScaler()), ('m', DecisionTreeClassifier())])
7 models.append(('norm', norm))
8 # standardization
9 std = Pipeline([('s', StandardScaler()), ('m', DecisionTreeClassifier())])
10 models.append(('std', std))
11 # robust
12 robust = Pipeline([('s', RobustScaler()), ('m', DecisionTreeClassifier())])
13 models.append(('robust', robust))
14 # power
15 power = Pipeline([('s', PowerTransformer()), ('m', DecisionTreeClassifier())])
16 models.append(('power', power))
17 # quantile
18 quant = Pipeline([('s', QuantileTransformer(n_quantiles=100, output_distribution='normal'))
19 models.append(('quant', quant))
20 # kbins
21 kbins = Pipeline([('s', KBinsDiscretizer(n_bins=20, encode='ordinal')), ('m', DecisionTreeC
22 models.append(('kbins', kbins))
23 # define the voting ensemble
24 ensemble = VotingClassifier(estimators=models, voting='hard')
25 return ensemble

We can then call this function and evaluate the voting ensemble as per normal, just like we did for the
decision tree above.

Tying this together, the complete example is listed below.

1 # evaluate data transform bagging ensemble on a classification dataset


2 from numpy import mean
3 from numpy import std
4 from sklearn.datasets import make_classification
5 from sklearn.model_selection import cross_val_score
6 from sklearn.model_selection import RepeatedStratifiedKFold
7 from sklearn.preprocessing import MinMaxScaler
8 from sklearn.preprocessing import StandardScaler
9 from sklearn.preprocessing import RobustScaler
10 from sklearn.preprocessing import PowerTransformer
11 from sklearn.preprocessing import QuantileTransformer
12 from sklearn.preprocessing import KBinsDiscretizer
13 from sklearn.tree import DecisionTreeClassifier
14 from sklearn.ensemble import VotingClassifier
15 from sklearn.pipeline import Pipeline

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
16
17 # get a voting ensemble of models
18 def get_ensemble():
19 # define the base models
20 models = list()
21 # normalization
22 norm = Pipeline([('s', MinMaxScaler()), ('m', DecisionTreeClassifier())])
23 models.append(('norm', norm))
24 # standardization
25 std = Pipeline([('s', StandardScaler()), ('m', DecisionTreeClassifier())])
26 models.append(('std', std))
27 # robust
28 robust = Pipeline([('s', RobustScaler()), ('m', DecisionTreeClassifier())])
29 models.append(('robust', robust))
30 # power
31 power = Pipeline([('s', PowerTransformer()), ('m', DecisionTreeClassifier())])
32 models.append(('power', power))
33 # quantile
34 quant = Pipeline([('s', QuantileTransformer(n_quantiles=100, output_distribution='normal'))
35 models.append(('quant', quant))
36 # kbins
37 kbins = Pipeline([('s', KBinsDiscretizer(n_bins=20, encode='ordinal')), ('m', DecisionTreeC
38 models.append(('kbins', kbins))
39 # define the voting ensemble
40 ensemble = VotingClassifier(estimators=models, voting='hard')
41 return ensemble
42
43 # define dataset
44 X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, r
45 # get models
46 ensemble = get_ensemble()
47 # define the evaluation procedure
48 cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
49 # evaluate the model
50 n_scores = cross_val_score(ensemble, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
51 # report performance
52 print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean classification accuracy of the data transform ensemble on the
synthetic classification dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or
differences in numerical precision. Consider running the example a few times and compare the
average outcome.

In this case, we can see that the data transform ensemble achieved a classification accuracy of about
83.8 percent, which is a lift over using a decision tree alone that achieved an accuracy of about 82.3
percent.

1 Mean Accuracy: 0.838 (0.042)

Although the ensemble performed well compared to a single decision tree, a limitation of this test is
that we do not know if the ensemble performed better than any contributing member.

This is important, as if a contributing member to the ensemble performs better, then it would be
simpler and easier to use the member itself as the model instead of the ensemble.

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
We can check this by evaluating the performance of each individual model and comparing the results
to the ensemble.

First, we can update the get_ensemble() function to return a list of models to evaluate composed of
the individual ensemble members as well as the ensemble itself.

1 # get a voting ensemble of models


2 def get_ensemble():
3 # define the base models
4 models = list()
5 # normalization
6 norm = Pipeline([('s', MinMaxScaler()), ('m', DecisionTreeClassifier())])
7 models.append(('norm', norm))
8 # standardization
9 std = Pipeline([('s', StandardScaler()), ('m', DecisionTreeClassifier())])
10 models.append(('std', std))
11 # robust
12 robust = Pipeline([('s', RobustScaler()), ('m', DecisionTreeClassifier())])
13 models.append(('robust', robust))
14 # power
15 power = Pipeline([('s', PowerTransformer()), ('m', DecisionTreeClassifier())])
16 models.append(('power', power))
17 # quantile
18 quant = Pipeline([('s', QuantileTransformer(n_quantiles=100, output_distribution='normal'))
19 models.append(('quant', quant))
20 # kbins
21 kbins = Pipeline([('s', KBinsDiscretizer(n_bins=20, encode='ordinal')), ('m', DecisionTreeC
22 models.append(('kbins', kbins))
23 # define the voting ensemble
24 ensemble = VotingClassifier(estimators=models, voting='hard')
25 # return a list of tuples each with a name and model
26 return models + [('ensemble', ensemble)]

We can call this function and enumerate each model, evaluating it, reporting the performance, and
storing the results.

1 ...
2 # get models
3 models = get_ensemble()
4 # evaluate each model
5 results = list()
6 for name,model in models:
7 # define the evaluation method
8 cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
9 # evaluate the model on the dataset
10 n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
11 # report performance
12 print('>%s: %.3f (%.3f)' % (name, mean(n_scores), std(n_scores)))
13 results.append(n_scores)

Finally, we can plot the distribution of accuracy scores as box and whisker plots side by side and
compare the distribution of scores directly.

Visually, we would hope that the spread of scores for the ensemble skews higher than any individual
member and that the central tendency of the distribution (mean and median) are also higher than any
member.

1 ...
2 # plot the results for comparison

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
3 pyplot.boxplot(results, labels=[n for n,_ in models], showmeans=True)
4 pyplot.show()

Tying this together, the complete example of comparing the performance of contributing members to
the performance of the data transform ensemble is listed below.

1 # comparison of data transform ensemble to each contributing member for classification


2 from numpy import mean
3 from numpy import std
4 from sklearn.datasets import make_classification
5 from sklearn.model_selection import cross_val_score
6 from sklearn.model_selection import RepeatedStratifiedKFold
7 from sklearn.preprocessing import MinMaxScaler
8 from sklearn.preprocessing import StandardScaler
9 from sklearn.preprocessing import RobustScaler
10 from sklearn.preprocessing import PowerTransformer
11 from sklearn.preprocessing import QuantileTransformer
12 from sklearn.preprocessing import KBinsDiscretizer
13 from sklearn.tree import DecisionTreeClassifier
14 from sklearn.ensemble import VotingClassifier
15 from sklearn.pipeline import Pipeline
16 from matplotlib import pyplot
17
18 # get a voting ensemble of models
19 def get_ensemble():
20 # define the base models
21 models = list()
22 # normalization
23 norm = Pipeline([('s', MinMaxScaler()), ('m', DecisionTreeClassifier())])
24 models.append(('norm', norm))
25 # standardization
26 std = Pipeline([('s', StandardScaler()), ('m', DecisionTreeClassifier())])
27 models.append(('std', std))
28 # robust
29 robust = Pipeline([('s', RobustScaler()), ('m', DecisionTreeClassifier())])
30 models.append(('robust', robust))
31 # power
32 power = Pipeline([('s', PowerTransformer()), ('m', DecisionTreeClassifier())])
33 models.append(('power', power))
34 # quantile
35 quant = Pipeline([('s', QuantileTransformer(n_quantiles=100, output_distribution='normal'))
36 models.append(('quant', quant))
37 # kbins
38 kbins = Pipeline([('s', KBinsDiscretizer(n_bins=20, encode='ordinal')), ('m', DecisionTreeC
39 models.append(('kbins', kbins))
40 # define the voting ensemble
41 ensemble = VotingClassifier(estimators=models, voting='hard')
42 # return a list of tuples each with a name and model
43 return models + [('ensemble', ensemble)]
44
45 # define dataset
46 X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, r
47 # get models
48 models = get_ensemble()
49 # evaluate each model
50 results = list()
51 for name,model in models:
52 # define the evaluation method
53 cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
54 # evaluate the model on the dataset
55 n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
56 # report performance
57 print('>%s: %.3f (%.3f)' % (name, mean(n_scores), std(n_scores)))
58 results.append(n_scores)
59 # plot the results for comparison

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
60 pyplot.boxplot(results, labels=[n for n,_ in models], showmeans=True)
61 pyplot.show()

Running the example first reports the mean and standard classification accuracy of each individual
model, ending with the performance of the ensemble that combines the models.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or
differences in numerical precision. Consider running the example a few times and compare the
average outcome.

In this case, we can see that a number of the individual members perform well, such as “kbins” that
achieves an accuracy of about 83.3 percent, and “std” that achieves an accuracy of about 83.1
percent. We can also see that the ensemble achieves better overall performance compared to any
contributing member, with an accuracy of about 83.4 percent.

1 >norm: 0.821 (0.041)


2 >std: 0.831 (0.045)
3 >robust: 0.826 (0.044)
4 >power: 0.825 (0.045)
5 >quant: 0.817 (0.042)
6 >kbins: 0.833 (0.035)
7 >ensemble: 0.834 (0.040)

A figure is also created showing box and whisker plots of classification accuracy for each individual
model as well as the data transform ensemble.

We can see that the distribution for the ensemble is skewed up, which is what we might hope, and
that the mean (green triangle) is slightly higher than those of the individual ensemble members.

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
Box and Whisker Plot of Accuracy Distribution for Individual Models and Data Transform Ensemble

Now that we are familiar with how to develop a data transform ensemble for classification, let’s look
at doing the same for regression.

Data Transform Ensemble for Regression


In this section, we will explore developing a data transform ensemble for a regression predictive
modeling problem.

First, we can define a synthetic binary regression dataset as the basis for exploring this type of
ensemble.

The example below creates a dataset with 1,000 examples each of 100 input features where 10 of
them contain information for predicting the target.

1 # synthetic regression dataset


2 from sklearn.datasets import make_regression
3 # define dataset
4 X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_st
5 # summarize the dataset
6 print(X.shape, y.shape)

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
Running the example creates the dataset and confirms the data has the expected shape.

1 (1000, 100) (1000,)

Next, we can establish a baseline in performance on the synthetic dataset by fitting and evaluating
the base model that we intend to use in the ensemble, in this case, a DecisionTreeRegressor.

The model will be evaluated using repeated k-fold cross-validation with three repeats and 10 folds.
Model performance on the dataset will be reported using the mean absolute error, or MAE. The scikit-
learn will invert the score (make it negative) so that the framework can maximize the score. As such,
we can ignore the sign on the score.

The example below evaluates the decision tree on the synthetic regression dataset.

1 # evaluate decision tree on synthetic regression dataset


2 from numpy import mean
3 from numpy import std
4 from sklearn.datasets import make_regression
5 from sklearn.model_selection import cross_val_score
6 from sklearn.model_selection import RepeatedKFold
7 from sklearn.tree import DecisionTreeRegressor
8 # define dataset
9 X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_s
10 # define the model
11 model = DecisionTreeRegressor()
12 # define the evaluation procedure
13 cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
14 # evaluate the model
15 n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
16 # report performance
17 print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the MAE of the decision tree on the synthetic regression dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or
differences in numerical precision. Consider running the example a few times and compare the
average outcome.

In this case, we can see that the model achieved a MAE of about 139.817. This provides a floor in
performance that we expect the ensemble model to improve upon.

1 MAE: -139.817 (12.449)

Next, we can develop and evaluate the ensemble.

We will use the same data transforms from the previous section. The VotingRegressor will be used to
combine the predictions, which is appropriate for regression problems.

The get_ensemble() function defined below creates the individual models and the ensemble model
and combines all of the models as a list of tuples for evaluation.

1 # get a voting ensemble of models

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
2 def get_ensemble():
3 # define the base models
4 models = list()
5 # normalization
6 norm = Pipeline([('s', MinMaxScaler()), ('m', DecisionTreeRegressor())])
7 models.append(('norm', norm))
8 # standardization
9 std = Pipeline([('s', StandardScaler()), ('m', DecisionTreeRegressor())])
10 models.append(('std', std))
11 # robust
12 robust = Pipeline([('s', RobustScaler()), ('m', DecisionTreeRegressor())])
13 models.append(('robust', robust))
14 # power
15 power = Pipeline([('s', PowerTransformer()), ('m', DecisionTreeRegressor())])
16 models.append(('power', power))
17 # quantile
18 quant = Pipeline([('s', QuantileTransformer(n_quantiles=100, output_distribution='normal'))
19 models.append(('quant', quant))
20 # kbins
21 kbins = Pipeline([('s', KBinsDiscretizer(n_bins=20, encode='ordinal')), ('m', DecisionTreeR
22 models.append(('kbins', kbins))
23 # define the voting ensemble
24 ensemble = VotingRegressor(estimators=models)
25 # return a list of tuples each with a name and model
26 return models + [('ensemble', ensemble)]

We can then call this function and evaluate each contributing modeling pipeline independently and
compare the results to the ensemble of the pipelines.

Our expectation, as before, is that the ensemble results in a lift in performance over any individual
model. If it does not, then the top-performing individual model should be chosen instead.

Tying this together, the complete example for evaluating a data transform ensemble for a regression
dataset is listed below.

1 # comparison of data transform ensemble to each contributing member for regression


2 from numpy import mean
3 from numpy import std
4 from sklearn.datasets import make_regression
5 from sklearn.model_selection import cross_val_score
6 from sklearn.model_selection import RepeatedKFold
7 from sklearn.preprocessing import MinMaxScaler
8 from sklearn.preprocessing import StandardScaler
9 from sklearn.preprocessing import RobustScaler
10 from sklearn.preprocessing import PowerTransformer
11 from sklearn.preprocessing import QuantileTransformer
12 from sklearn.preprocessing import KBinsDiscretizer
13 from sklearn.tree import DecisionTreeRegressor
14 from sklearn.ensemble import VotingRegressor
15 from sklearn.pipeline import Pipeline
16 from matplotlib import pyplot
17
18 # get a voting ensemble of models
19 def get_ensemble():
20 # define the base models
21 models = list()
22 # normalization
23 norm = Pipeline([('s', MinMaxScaler()), ('m', DecisionTreeRegressor())])
24 models.append(('norm', norm))
25 # standardization
26 std = Pipeline([('s', StandardScaler()), ('m', DecisionTreeRegressor())])
27 models.append(('std', std))

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
28 # robust
29 robust = Pipeline([('s', RobustScaler()), ('m', DecisionTreeRegressor())])
30 models.append(('robust', robust))
31 # power
32 power = Pipeline([('s', PowerTransformer()), ('m', DecisionTreeRegressor())])
33 models.append(('power', power))
34 # quantile
35 quant = Pipeline([('s', QuantileTransformer(n_quantiles=100, output_distribution='normal'))
36 models.append(('quant', quant))
37 # kbins
38 kbins = Pipeline([('s', KBinsDiscretizer(n_bins=20, encode='ordinal')), ('m', DecisionTreeR
39 models.append(('kbins', kbins))
40 # define the voting ensemble
41 ensemble = VotingRegressor(estimators=models)
42 # return a list of tuples each with a name and model
43 return models + [('ensemble', ensemble)]
44
45 # generate regression dataset
46 X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_s
47 # get models
48 models = get_ensemble()
49 # evaluate each model
50 results = list()
51 for name,model in models:
52 # define the evaluation method
53 cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
54 # evaluate the model on the dataset
55 n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1
56 # report performance
57 print('>%s: %.3f (%.3f)' % (name, mean(n_scores), std(n_scores)))
58 results.append(n_scores)
59 # plot the results for comparison
60 pyplot.boxplot(results, labels=[n for n,_ in models], showmeans=True)
61 pyplot.show()

Running the example first reports the MAE of each individual model, ending with the performance of
the ensemble that combines the models.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or
differences in numerical precision. Consider running the example a few times and compare the
average outcome.

We can see that each model performs about the same, with MAE error scores around 140, all higher
than the decision tree used in isolation. Interestingly, the ensemble performs the best, out-performing
all of the individual members and the tree with no transforms, achieving a MAE of about 126.487.

This result suggests that although each pipeline performs worse than a single tree without transforms,
each pipeline is making different errors and that the average of the models is able to leverage and
harness these differences toward lower error.

1 >norm: -140.559 (11.783)


2 >std: -140.582 (11.996)
3 >robust: -140.813 (11.827)
4 >power: -141.089 (12.668)
5 >quant: -141.109 (11.097)
6 >kbins: -145.134 (11.638)
7 >ensemble: -126.487 (9.999)

A figure is created comparing the distribution of MAE scores for each pipeline and the ensemble.

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
As we hoped, the distribution for the ensemble skews higher compared to all of the other models and
has a higher (smaller) central tendency (mean and median indicated by the green triangle and orange
line respectively).

Box and Whisker Plot of MAE Distributions for Individual Models and Data Transform Ensemble

Further Reading
This section provides more resources on the topic if you are looking to go deeper.

Tutorials
How to Develop Voting Ensembles With Python

Books
Pattern Classification Using Ensemble Methods, 2010.
Ensemble Methods, 2012.
Ensemble Machine Learning, 2012.

APIs

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
sklearn.ensemble.VotingClassifier API.
sklearn.ensemble.VotingRegressor API.

Summary
In this tutorial, you discovered how to develop a data transform ensemble.

Specifically, you learned:

Data transforms can be used as the basis for a bagging-type ensemble where the same
model is trained on different views of a training dataset.
How to develop a data transform ensemble for classification and confirm the ensemble
performs better than any contributing member.
How to develop and evaluate a data transform ensemble for regression predictive modeling.

Do you have any questions?


Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Modern Ensemble Learning!


Improve Your Predictions in Minutes
...with just a few lines of python code

Discover how in my new Ebook:


Ensemble Learning Algorithms With Python

It provides self-study tutorials with full working code on:


Stacking, Voting, Boosting, Bagging, Blending, Super Learner, and much more...

Bring Modern Ensemble Learning Techniques to


Your Machine Learning Projects

SEE WHAT'S INSIDE

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
More On This Topic

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
How to Develop a Bagging Bagging and Random Forest How to Create a Bagging
Ensemble with Python Ensemble Algorithms for… Ensemble of Deep
Learning…

How to Develop an Ensemble How to Develop a Horizontal How to Develop a Weighted


of Deep Learning Models in Voting Deep Learning… Average Ensemble for
Keras Deep…

 Multivariate Adaptive Regression Splines (MARS) in Python


How to Develop a Feature Selection Subspace Ensemble in Python 

8 Responses to Develop a Bagging Ensemble with Different Data


Transformations

Rekha November 21, 2020 at 1:31 am # REPLY 

Can we use this for sequence prediction


(regression) problems where we have less
number of features

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
REPLY
Jason Brownlee November 21, 2020 at 6:43 am #

I don’t see why not.

Igors Papka November 21, 2020 at 7:40 amREPLY


# 

Dear Dr. Jason,


Thank you for the tutorial.
I tried to experiment with estimators with the
housing dataset and found out that the best
result was achieved with xgboost regressor on
unseen data instead of decision trees.
PolynomialFeatures() as an additional
transformer also improves the score.

Jason Brownlee November 21, 2020REPLY


at 7:42 
am

You’re welcome.

Nice work!

Celeste Short December 28, 2020 at REPLY


10:21 am#

Thank you very much. This was awesome.

REPLY
Jason Brownlee December 28, 2020 at 1:14 pm #

You’re welcome!

Mehdi December 1, 2022 at 3:43 am # REPLY 

Dear Dr. Jason


Can I use this way for the data transformer that
I write?
I have the data from stock and want to run
this way for different technical indicator the
create from data

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
REPLY
James Carmichael December 1, 2022 
at 8:08

am #

Hi Mehdi…In theory there would be no


issue. Please clarify the goals of your
model so that we may better assist you.

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF
Information from your device can be used to personalize your ad experience.

Do not sell or share my personal information.

Explore our developer-friendly HTML to PDF API Printed using PDFCrowd HTML to PDF

You might also like