8 Ejercicio - Optimización y Guardado de Modelos - Training - Microsoft Learn Ingles
8 Ejercicio - Optimización y Guardado de Modelos - Training - Microsoft Learn Ingles
8 Ejercicio - Optimización y Guardado de Modelos - Training - Microsoft Learn Ingles
# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, ran
print ('Training Set: %d rows\nTest Set: %d rows' % (X_train.shape[0], X_test
[1] 6 s
--2022-10-26 15:58:53--
https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/daily-bike-share.csv
Connecting to raw.githubusercontent.com
(raw.githubusercontent.com)|185.199.109.133|:443... connected.
Now we're ready to train a model by fitting a boosting ensemble algorithm, as in our last
notebook. Recall that a Gradient Boosting estimator, is like a Random Forest algorithm, but
instead of building them all trees independently and taking the average result, each tree is
built on the outputs of the previous one in an attempt to incrementally reduce the loss
(error) in the model.
# Train the model
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
# Fit a lasso model on the training set
model = GradientBoostingRegressor().fit(X_train, y_train)
print (model, "\n")
# Evaluate the model using the test data
https://learn.microsoft.com/es-es/training/modules/train-evaluate-regression-models/7-exercise-optimize-save-models 1/13
26/10/22, 11:07 Ejercicio: Optimización y guardado de modelos - Training | Microsoft Learn
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
[2] <1 s
GradientBoostingRegressor()
MSE: 104391.33046697672
RMSE: 323.09647238398736
R2: 0.7953473485996554
Optimize Hyperparameters
Take a look at the GradientBoostingRegressor estimator definition in the output above, and
note that it, like the other estimators we tried previously, includes a large number of
parameters that control the way the model is trained. In machine learning, the term
parameters refers to values that can be determined from data; values that you specify to
ff t th b h i f t i i l ith tl f dt
https://learn.microsoft.com/es-es/training/modules/train-evaluate-regression-models/7-exercise-optimize-save-models
h t 2/13
26/10/22, 11:07 Ejercicio: Optimización y guardado de modelos - Training | Microsoft Learn
affect the behavior of a training algorithm are more correctly referred to as hyperparameters.
The specific hyperparameters for an estimator vary based on the algorithm that the
estimator encapsulates. In the case of the GradientBoostingRegressor estimator, the
algorithm is an ensemble that combines multiple decision trees to create an overall
predictive model. You can learn about the hyperparameters for this estimator in the Scikit-
Learn documentation .
We won't go into the details of each hyperparameter here, but they work together to affect
the way the algorithm trains a model. In many cases, the default values provided by Scikit-
Learn will work well; but there may be some advantage in modifying hyperparameters to get
better predictive performance or reduce training time.
So how do you know what hyperparameter values you should use? Well, in the absence of a
deep understanding of how the underlying algorithm works, you'll need to experiment.
Fortunately, SciKit-Learn provides a way to tune hyperparameters by trying multiple
combinations and finding the best result for a given performance metric.
Let's try using a grid search approach to try combinations from a grid of possible values for
the learning_rate and n_estimators hyperparameters of the GradientBoostingRegressor
estimator.
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, r2_score
# Use a Gradient Boosting algorithm
alg = GradientBoostingRegressor()
# Try these hyperparameter values
params = {
'learning_rate': [0.1, 0.5, 1.0],
'n_estimators' : [50, 100, 150]
}
# Find the best hyperparameter combination to optimize the R2 metric
score = make_scorer(r2_score)
gridsearch = GridSearchCV(alg, params, scoring=score, cv=3, return_train_scor
gridsearch.fit(X_train, y_train)
print("Best parameter combination:", gridsearch.best_params_, "\n")
# Get the best model
model=gridsearch.best_estimator_
print(model, "\n")
# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
https://learn.microsoft.com/es-es/training/modules/train-evaluate-regression-models/7-exercise-optimize-save-models 3/13
26/10/22, 11:07 Ejercicio: Optimización y guardado de modelos - Training | Microsoft Learn
# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
[3] 6 s
GradientBoostingRegressor()
MSE: 103663.66480123281
RMSE: 321.96842205600353
R2: 0.7967738914663985
“
Note: The use of random values in the Gradient Boosting algorithm results in slightly
different metrics each time. In this case, the best model produced by hyperparameter
tuning is unlikely to be significantly better than one trained with the default
hyperparameter values; but it's still useful to know about the hyperparameter tuning
technique!
In practice, it's common to perform some preprocessing of the data to make it easier for the
algorithm to fit a model to it. There's a huge range of preprocessing transformations you can
perform to get your data ready for modeling, but we'll limit ourselves to a few common
techniques:
A B C
3 480 65
Normalizing these features to the same scale may result in the following values (assuming A
contains values from 0 to 10, B contains values from 0 to 1000, and C contains values from 0
to 100):
A B C
There are multiple ways you can scale numeric data, such as calculating the minimum and
maximum values for each column and assigning a proportional value between 0 and 1, or by
using the mean and standard deviation of a normally distributed variable to maintain the
same spread of values on a different scale.
Size
Size
You can apply ordinal encoding to substitute a unique integer value for each category, like
this:
https://learn.microsoft.com/es-es/training/modules/train-evaluate-regression-models/7-exercise-optimize-save-models 5/13
26/10/22, 11:07 Ejercicio: Optimización y guardado de modelos - Training | Microsoft Learn
Size
Another common technique is to use one hot encoding to create individual binary (0 or 1)
features for each possible category value. For example, you could use one-hot encoding to
translate the possible categories into binary columns like this:
1 0 0
0 1 0
0 0 1
To apply these preprocessing transformations to the bike rental, we'll make use of a Scikit-
Learn feature named pipelines. These enable us to define a set of preprocessing steps that
end with an algorithm. You can then fit the entire pipeline to the data, so that the model
encapsulates all of the preprocessing steps as well as the regression algorithm. This is useful,
because when we want to use the model to predict values from new data, we need to apply
the same transformations (based on the same statistical distributions and category
encodings used with the training data).
“
Note: The term pipeline is used extensively in machine learning, often to mean very
different things! In this context, we're using it to refer to pipeline objects in Scikit-Learn,
but you may see it used elsewhere to mean something else.
# Train the model
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
import numpy as np
# Define preprocessing for numeric columns (scale them)
numeric_features = [6,7,8,9]
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())])
# Define preprocessing for categorical features (encode them)
categorical_features = [0,1,2,3,4,5]
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))])
https://learn.microsoft.com/es-es/training/modules/train-evaluate-regression-models/7-exercise-optimize-save-models 6/13
26/10/22, 11:07 Ejercicio: Optimización y guardado de modelos - Training | Microsoft Learn
# Combine preprocessing steps
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', GradientBoostingRegressor())])
# fit the pipeline to train a linear regression model on the training set
model = pipeline.fit(X_train, (y_train))
print (model)
[4] <1 s
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('scaler',
StandardScaler())]),
[6, 7, 8, 9]),
('cat',
Pipeline(steps=[('onehot',
OneHotEncoder(handle_unknown='ignore'))]),
[0, 1, 2, 3, 4, 5])])),
('regressor', GradientBoostingRegressor())])
OK, the model is trained, including the preprocessing steps. Let's see how it performs with
the validation data.
# Get predictions
predictions = model.predict(X_test)
# Display metrics
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
# Plot predicted vs actual
plt.scatter(y test, predictions)
https://learn.microsoft.com/es-es/training/modules/train-evaluate-regression-models/7-exercise-optimize-save-models 7/13
26/10/22, 11:07 Ejercicio: Optimización y guardado de modelos - Training | Microsoft Learn
p t.scatte (y_test, p ed ct o s)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
[6] <1 s
MSE: 105423.61001286482
RMSE: 324.6900214248427
R2: 0.7933236293395663
The pipeline is composed of the transformations and the algorithm used to train the model.
To try an alternative algorithm you can just change that step to a different kind of estimator.
# Use a different estimator in the pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', RandomForestRegressor())])
# fit the pipeline to train a linear regression model on the training set
model = pipeline.fit(X_train, (y_train))
print (model, "\n")
# Get predictions
predictions = model.predict(X_test)
https://learn.microsoft.com/es-es/training/modules/train-evaluate-regression-models/7-exercise-optimize-save-models 8/13
26/10/22, 11:07 Ejercicio: Optimización y guardado de modelos - Training | Microsoft Learn
# Display metrics
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions - Preprocessed')
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
[8] 2 s
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('scaler',
StandardScaler())]),
[6, 7, 8, 9]),
('cat',
Pipeline(steps=[('onehot',
OneHotEncoder(handle_unknown='ignore'))]),
[0, 1, 2, 3, 4, 5])])),
('regressor', RandomForestRegressor())])
MSE: 106169.34044681821
RMSE: 325.8363706629728
R2: 0.7918616716285592
https://learn.microsoft.com/es-es/training/modules/train-evaluate-regression-models/7-exercise-optimize-save-models 9/13
26/10/22, 11:07 Ejercicio: Optimización y guardado de modelos - Training | Microsoft Learn
We've now seen a number of common techniques used to train predictive models for
regression. In a real project, you'd likely try a few more algorithms, hyperparameters, and
preprocessing transformations; but by now you should have got the general idea. Let's
explore how you can use the trained model with new data.
import joblib
# Save the model as a pickle file
filename = './bike-share.pkl'
joblib.dump(model, filename)
[10] <1 s
['./bike-share.pkl']
Now, we can load it whenever we need it, and use it to predict labels for new data. This is
often called scoring or inferencing.
# Load the model from the file
loaded_model = joblib.load(filename)
# Create a numpy array containing a new observation (for example tomorrow's s
X_new = np.array([[1,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869]]).astype('fl
print ('New sample: {}'.format(list(X_new[0])))
# Use the model to predict tomorrow's rentals
result = loaded_model.predict(X_new)
print('Prediction: {:.0f} rentals'.format(np.round(result[0])))
[12] <1 s
New sample: [1.0, 1.0, 0.0, 3.0, 1.0, 1.0, 0.226957, 0.22927, 0.436957, 0.1869]
https://learn.microsoft.com/es-es/training/modules/train-evaluate-regression-models/7-exercise-optimize-save-models 10/13
26/10/22, 11:07 Ejercicio: Optimización y guardado de modelos - Training | Microsoft Learn
The model's predict method accepts an array of observations, so you can use it to generate
multiple predictions as a batch. For example, suppose you have a weather forecast for the
next five days; you could use the model to predict bike rentals for each day based on the
expected weather conditions.
# An array of features based on five-day weather forecast
X_new = np.array([[0,1,1,0,0,1,0.344167,0.363625,0.805833,0.160446],
[0,1,0,1,0,1,0.363478,0.353739,0.696087,0.248539],
[0,1,0,2,0,1,0.196364,0.189405,0.437273,0.248309],
[0,1,0,3,0,1,0.2,0.212122,0.590435,0.160296],
[0,1,0,4,0,1,0.226957,0.22927,0.436957,0.1869]])
# Use the model to predict rentals
results = loaded_model.predict(X_new)
print('5-day rental predictions:')
for prediction in results:
print(np.round(prediction))
[14] <1 s
446.0
673.0
233.0
204.0
261.0
Summary
That concludes the notebooks for this module on regression. In this notebook we ran a
complex regression, tuned it, saved the model, and used it to predict outcomes for the
future.
Further Reading
C ti T
https://learn.microsoft.com/es-es/training/modules/train-evaluate-regression-models/7-exercise-optimize-save-models 11/13
26/10/22, 11:07 Ejercicio: Optimización y guardado de modelos - Training | Microsoft Learn
Continuar T
https://learn.microsoft.com/es-es/training/modules/train-evaluate-regression-models/7-exercise-optimize-save-models 12/13
26/10/22, 11:07 Ejercicio: Optimización y guardado de modelos - Training | Microsoft Learn
https://learn.microsoft.com/es-es/training/modules/train-evaluate-regression-models/7-exercise-optimize-save-models 13/13