Supervised Machine Learning
Supervised Machine Learning
Supervised Machine Learning
Date Comments
2019-01-18 Initial version. Chapter 6 missing.
2019-01-23 Typos corrected, mainly in Chapter 2 and 5. Section 2.6.3 added.
2019-01-28 Typos corrected, mainly in Chapter 3.
2019-02-07 Chapter 6 added.
2019-03-04 Typos corrected.
2019-03-11 Typos (incl. eq. (3.25)) corrected.
2019-03-12 Typos corrected.
2
Contents
1 Introduction 7
1.1 What is machine learning all about? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Regression and classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Overview of these lecture notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3
Contents
6 Ensemble methods 67
6.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.1.1 Variance reduction by averaging . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.1.2 The bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.2 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3.1 The conceptual idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3.2 Binary classification, margins, and exponential loss . . . . . . . . . . . . . . . . 75
6.3.3 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3.4 Boosting vs. bagging: base models and ensemble size . . . . . . . . . . . . . . 79
6.3.5 Robust loss functions and gradient boosting . . . . . . . . . . . . . . . . . . . . 80
6.A Classification loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4
Contents
Bibliography 111
5
1 Introduction
The science of machine learning is about learning models that generalize well.
These lecture notes are exclusively about supervised learning, which refers to the problem where
the data is on the form {xi , yi }ni=1 , where xi denotes inputs1 and yi denotes outputs2 . In other words,
in supervised learning we have labeled data in the sense that each data point has an input xi and an
output yi which explicitly explains ”what we see in the data”. For example, to check for signs of heart
disease medical doctors makes use of a so-called electrocardiogram (ECG) which is a test that measures
the electrical activity of the heart via electrodes placed on the skin of the patients chest, arms and legs.
Based on these readings a skilled medical doctor can then make a diagnosis. In this example the ECG
measurements constitutes the input x and the diagnosis provided by the medical doctor constitutes the
output y. If we have access to a large enough pool of labeled data of this kind (where we both have the
ECG reading x and the diagnosis y) we can use supervised machine learning to learn a model for the
relationship between x and y. Once the model is learned, is can be used to diagnose new ECG readings,
for which we do not (yet) know the diagnosis y. This is called a prediction, and we use yb to denote it. If
the model is making good predictions (close to the true y) also for ECGs not in the training data, we have
a model which generalizes well.
One of the most challenging problems with supervised learning is that it requires labeled data, i.e.
both the inputs and the corresponding outputs {xi , yi }ni=1 . This is challenging because the process of
labeling data is often expensive and sometimes also difficult or even impossible since it requires humans
to interpret the input and provide the correct output. The situation is made even worse due to the fact that
most of the state-of-the-art methods require a lot of data to perform well. This situation has motivated the
development of unsupervised learning methods which only require the input data {xi }ni= , i.e. so-called
unlabeled data. An important subproblem is that of clustering, where data is automatically organized
into different groups based on some notion of similarity. There is also an increasingly important middle
ground referred to as semi-supervised learning, where we make use of both labeled and unlabeled data.
The reason being that we often have access to a lot of unlabeled data, but only a small amount of labeled
data. However, this small amount of labeled data might still prove highly valuable when used together
with the much larger set of unlabeled data.
In the area of reinforcement learning, another branch of machine learning, we do not only want to make
use of measured data in order to be able to predict something or understand a given situation, but instead
1
Some common synonyms used for the input variable include feature, predictor, regressor, covariate, explanatory variable,
controlled variable and independent variable.
2
Some common synonyms used for the output variable include response, regressand, label, explained variable, predicted
variable and dependent variable.
7
1 Introduction
we want to develop a system that can learn how to take actions in the real world. The most common
approach is to learn these actions by trying to maximize some kind of reward encouraging the desired
state of the environment. The area of reinforcement learning has very strong ties to control theory. Finally
we mention the emerging area of causal learning where the aim is to tackle the much harder problem of
learning cause and effect relationships. This is very different from the other facets of machine learning
briefly introduced above, where it was sufficient to learn associations/correlations between the data. In
causal learning the aim is to move beyond learning correlations and instead trying to learn causal relations.
Depending on whether the output of a problem is quantitative or qualitative , we refer to the problem as
either regression or classification.
Regression means the output is quantitative, and classification means the output is qualitative.
This means that whether a problem is about regression or classification depends only on its output. The
input can be either quantitative or qualitative in both cases.
The distinction between quantitative and qualitative, and thereby between regression and classification,
is however somewhat arbitrary, and there is not always a clear answer: one could for instance argue
that having no children is something qualitatively different than having children, and use the qualitative
output “children: yes/no”, instead of “0, 1 or 2 children”, and thereby turn a regression problem into a
classification problem, for example.
8
1.3 Overview of these lecture notes
Chapter 1: Introduction
needed
recommended
9
2 The regression problem and linear regression
The first problem we will study is the regression problem. Regression is one of the two main problems
that we cover in these notes. (The other one is classification). The first method we will encounter is linear
regression, which is one (of many) solutions to the regression problem. Even though the relative simplicity
of linear regression, it is surprisingly useful and it also constitutes an important building block for more
advanced methods (such as deep learning, Chapter 7).
Ezekiel and Fox (1959) presents a dataset with 62 observations of the distance needed for various cars at
different initial speeds to break to a full stop.a The dataset has the two following variables:
- Speed: The speed of the car when the break signal is given.
- Distance: The distance traveled after the signal is given until the car has reached a full stop.
We decide to interpret Speed as the input variable x, and Distance as the output variable y.
150
Data
Distance (feet)
100
50
0
0 10 20 30 40
Speed (mph)
Our goal is to use linear regression to estimate (that is, to predict) how long the stopping distance would
be if the initial speed would be 33 mph or 45 mph (two speeds at which no data has been recorded).
a
The dataset is somewhat dated, so the conclusions are perhaps not applicable to modern cars. We believe, however,
that the reader is capable of pretending that the data comes from her/his own favorite example instead.
1
We will start with quantitative input variables, and discuss qualitative input variables later in 2.5.
11
2 The regression problem and linear regression
y = β0 + β1 x1 + β2 x2 + · · · + βp xp +ε. (2.2)
| {z }
f (x;β)
We refer to the coefficients β0 , β1 , . . . βp as the parameters in the model, and we sometimes refer to β0
specifically as the intercept term. The noise term ε accounts for non-systematic, i.e., random, errors
between the data and the model. The noise is assumed to have mean zero and to be independent of x.
Machine learning is about training, or learning, models from data. Hence, the main part of this chapter will
be devoted to how to learn the parameters β0 , β1 , . . . , βp from some training dataset T = {(xi , yi )}ni=1 .
Before we dig into the details in Section 2.3, let us just briefly start by discussing the purpose of using
linear regression. The linear regression model can namely be used for, at least, two different purposes:
to describe relationships in the data by interpreting the parameters β = [β0 β1 . . . βp ]T , and to predict
future outputs for inputs that we have not yet seen.
Remark 2.1 It is possible to formulate the model also for multiple outputs y1 , y2 , . . . , see the exercises.
This is commonly referred to as multivariate linear regression.
We use the symbol b on y? to indicate that it is a prediction, our best guess. If we were able to somehow
observe the actual output from x? , we would denote it by y? (without a hat).
12
2.3 Learning the model from training data
prediction
b?
y Data
Prediction
output y
y3
ε
data
y2 ε
y1
ε
x1 x2 x3 x?
Note that X is a n × (p + 1) matrix, and y a n-dimensional vector. The first column of X, with only ones,
corresponds to the intercept term β0 in the linear regression model (2.2). If we also stack the unknown
parameters β0 , β1 , . . . , βp into a (p + 1) vector
β0
β1
β = . , (2.5)
..
βp
y = Xβ + , (2.6)
13
2 The regression problem and linear regression
We will continue Example 2.1, and form the matrices X and y. Since we only have one input and one output,
both xi and yi are scalar. We get,
1 4 4
1 5 2
1 5 4
1 5 8
1 5 8
1 7 7
β0
X = 1 7 , β= , y = 7 . (2.7)
β1
1 8 8
.. .. ..
. . .
1 39 138
1 39 110
1 40 134
where p(y | X, β) is the probability density of the data y given a certain value of the parameters β. We
denote the solution to this problem—the learned parameters—with β b = [βb0 βb1 · · · βbp ]T . More compactly,
we write this as
b = arg max p(y | X, β).
β (2.9)
β
In order to have a notion of what ‘likely’ means, and thereby specify p(y | X, β) mathematically, we
need to make assumptions about the noise term ε. A common assumption is that ε follows a Gaussian
distribution with zero mean and variance σε2 ,
ε ∼ N 0, σε2 . (2.10)
This implies that the conditional probability density function of the output y for a given value of the input
x is given by
p(y | x, β) = N y | β0 + β1 x1 + · · · + βp xp , σε2 . (2.11)
Furthermore, the n observed training data points are assumed to be independent realizations from this
statistical model. This implies that the likelihood of the training data factorizes as
n
Y
p(y | X, β) = p(yi | xi , β). (2.12)
i=1
14
2.3 Learning the model from training data
Recall from (2.8) that we want to maximize the likelihood w.r.t. β. However, since (2.13) only depends
on β via the sum in the exponent, and since the exponential is a monotonically increasing function,
maximizing (2.13) is equivalent to minimizing
n
X
(β0 + β1 xi1 + · · · + βp xip − yi )2 . (2.14)
i=1
This is the sum of the squares of differences between each output data yi and the model’s prediction of
that output, ybi = β0 + β1 xi1 + · · · + βp xip . For this reason, minimizing (2.14) is usually referred to as
least squares.
We will come back on how the values βb0 , βb1 , . . . , βbp can be computed. Let us just first mention that
it is also possible—and sometimes a very good idea—to assume that the distribution of ε is something
else than a Gaussian distribution. One can, for instance, assume that ε instead has a Laplace distribution,
which would yield the cost function
n
X
|β0 + β1 xi1 + . . . βp xip − yi |. (2.15)
i=1
It contains the sum of the absolute values of all differences (rather than their squares). The major benefit
with the Gaussian assumption (2.10) is that there is a closed-form solution available for βb0 , βb1 , . . . , βbp ,
whereas other assumptions on ε usually require computationally more expensive methods.
Remark 2.2 With the terminoloy we will introduce in the next chapter, we could refer to (2.13) as the
likelihood function, which we will denote by ` (β).
Remark 2.3 It is not uncommon in the literature to skip the maximum likelihood motivation, and just
state (2.14) as a (somewhat arbitrary) cost function for optimization.
where k · k2 denotes the usual Euclidean vector norm, and k · k22 its square. From a linear algebra point
of view, this can be seen as the problem of finding the closest (in an Euclidean sense) vector to y in the
subspace of Rn spanned by the columns of X. The solution to this problem is the orthogonal projection of
y onto this subspace, and the corresponding β b can be shown (Section 2.A) to fulfill
b = XT y.
XT Xβ (2.17)
Equation (2.17) is often referred to as the normal equations, and gives the solution to the least squares
problem (2.14, 2.16). If XT X is invertible, which often is the case, β
b has the closed form
b = (XT X)−1 XT y.
β (2.18)
The fact that this closed-form solution exists is important, and is perhaps the reason why least squares has
become very popular and is widely used. As discussed, other assumptions on ε than Gaussianity leads to
other problems than least squares, such as (2.15) (where no closed-form solution exists).
Time to reflect 2.1: What does it mean in practice that XT X is not invertible?
15
2 The regression problem and linear regression
Model
Data
ε
output y
input x
Figure 2.2: A graphical explanation of the least squares criterion: the goal is to choose the model (blue line) such
that the sum of the square (orange) of each error ε (green) is minimized. That is, the blue line is to be chosen so that
the amount of orange color is minimized. This motivates the name least squares.
Time to reflect 2.2: If the columns of X are linearly independent and p = n − 1, X spans the
entire Rn . That means a unique solution exists such that y = Xβ exactly, i.e., the model fits the
training data perfectly. If that is the case, (2.17) reduces to β = X−1 y, and the model fits the data
perfectly. Why is that not desired?
By inserting the matrices (2.7) from Example 2.2 into the normal equations (2.6), we obtain βb0 = −20.1
and βb1 = 3.1. If we plot the resulting model, it looks like this:
150
Model
Data
Predictions
100
Distance (feet)
50
0
0 10 20 30 40
Speed (mph)
With this model, the predicted stopping distance for x? = 33 mph is yb? = 84 feet, and for x? = 45 mph it is
yb? = 121 feet.
The reason for the word ‘linear’ in the name ‘linear regression’ is that the output is modelled as a linear
combination of the inputs.2 We have, however, not made a clear definition of what an input is: if the speed
is an input, then why could not also the kinetic energy—it’s square—be considered as another input? The
answer is yes, it can. We can in fact make use of arbitrary nonlinear transformations of the “original”
input variables as inputs in the linear regression model. If we, for example, only have a one-dimensional
2
And also the constant 1, corresponding to the offset β 0 . For this reason, affine would perhaps be a better term than linear.
16
2.4 Nonlinear transformations of the inputs – creating more features
Model Model
output y
Data Data
output y
input x input x
(a) The maximum likelihood solution with a 2nd order poly- (b) The maximum likelihood solution with a 4th order poly-
nomial in the linear regression model. As discussed, the line nomial in the linear regression model. Note that a 4th order
is no longer straight (cf. Figure 2.1). This is, however, merely polynomial contains 5 unknown coefficients, which roughly
an artefact of the plot: in a three-dimensional plot with each means that we can expect the learned model to fit 5 data points
feature (here, x and x2 ) on a separate axis, it would still be an exactly (cf. Remark 2.2, p = n − 1).
affine set.
Figure 2.3: A linear regression model with 2nd and 4th order polynomials in the input x, as shown in (2.20).
y = β0 + β1 x + ε. (2.19)
However, we can also extend the model with, for instance, x2 , x3 , . . . , xp as inputs, and thus obtain a
linear regression model which is a polynomial in x,
y = β0 + β1 x + β2 x2 + · · · + βp xp + ε. (2.20)
Note that this is still a linear regression model since the unknown parameters appear in a linear fashion
with x, x2 , . . . , xp as new inputs. The parameters βb are still learned the same way, but the matrix X
is different for model (2.19) and (2.20). We will refer to the transformed inputs as features. In more
complicated settings the distinction between the original input and the transformed features might not be
as clear, and the terms feature and input can sometimes be used interchangeably.
Time to reflect 2.3: Figure 2.3 shows an example of two linear regression models with transformed
(polynomial) inputs. When studying the figure one may ask how a linear regression model can
result in a curved line? Are linear regression models not restricted to linear (or affine) straight
lines? The answer is that it depends on the plot: Figure 2.3(a) shows a two-dimensional plot with
x, y (the ‘original’ input), but a three-dimensional plot with x, x2 , y (each feature on a separate
axis) would still be affine. The same holds true also for Figure 2.3(b) but in that case we would
need a 5-dimensional plot.
Even though the model in Figure 2.3(b) is able to fit all data points exactly, it also suggests that higher
order polynomials might not always be very useful: the behavior of the model in-between and outside the
data points is rather peculiar, and not very well motivated by the data. Higher-order polynomials are for
this reason rarely used in practice in machine learning. An alternative and much more common feature is
the so-called radial basis function (RBF) kernel
|x − c|22
Kc (x) = exp − , (2.21)
`
i.e., a Gauss bell centered around c. It can be used, instead of polynomials, in the linear regression model
as
y = β0 + β1 Kc1 (x) + β2 Kc2 (x) + · · · + βp Kcp (x) + ε. (2.22)
17
2 The regression problem and linear regression
This model is can be seen as p ‘bumps’ located at c1 , c2 , . . . , cp , respectively. Note that the locations
c1 , c2 , . . . , cp as well as the length scale ` have to be decided by the user, and only the parameters
β0 , β2 , . . . , βp are learned from data in linear regression. This is illustrated in Figure 2.4. RBF kernels are
in general preferred over polynomials since they have ‘local’ properties, meaning that a small change in
one parameter mostly affects the model only locally around that kernel, whereas a small change in one
parameter in a polynomial model affects the model everywhere.
Example 2.4: Car stopping distances
We continue with Example 2.1, but this time we also add the squared speed as a feature, i.e., the features are
now x and x2 . This gives the new matrices (cf. (2.7))
1 4 16 4
1 5 25 2
1 5 25 β 0 4
X = . .. .. ,
β = β1 , y = . , (2.23)
.. . . ..
β2
1 39 1521 110
1 40 1600 134
and when we insert them into the normal equations (2.17), the new parameter estimates are βb0 = 1.58,
βb1 = 0.42 and βb2 = 0.07. (Note that βb0 and βb1 change, compared to Example 2.3.) This new model looks
like
150
Model
Data
Predictions
100
Distance (feet)
50
0
0 10 20 30 40
Speed (mph)
With this model, the predicted stopping distance is now yb? = 87 feet for x? = 33 mph, and yb? = 153 for
x? = 45 mph. This can be compared to Example 2.3, which gives different predictions. Based on the data
alone we can not say that this is the “true model”, but by visually comparing this model with Example 2.3,
this model with more features seems to follow the data slightly better. A systematic method to select between
different features (other than just visually comparing plots) is cross-validation, see Chapter 5.
Model
output y
β2
β4
β1
β3
input x
c1 c2 c3 c4
Figure 2.4: A linear regression model using RBF kernels (2.22) as features. Each kernel (dashed gray lines) is
located at c1 , c2 , c3 and c4 , respectively. When the model is learned from data, the parameters β0 , β1 , . . . , βp are
chosen such that the sum of all kernels (solid blue line) is fitted to the data in, e.g., a least squares sense.
Polynomials and RBF kernels are just two special cases, but we can of course consider any nonlinear
transformation of the inputs. To distinguish the ‘original’ inputs from the ‘new’ transformed inputs, the
term features is often used for the latter. To decide which features to use one approach is to compare
18
2.5 Qualitative input variables
and use this variable in the linear regression model. This effectively gives us a linear regression model
which looks like
(
β0 + ε if type A
y = β0 + β1 x + ε = (2.25)
β0 + β1 + ε if type B
The choice is somewhat arbitrary, and type A and B can of course be switched. Other choices, such as
x = 1 or x = −1, are also possible. This approach can be generalized to qualitative input variables which
take more than two values, let us say type A, B, C and D. With four different values, we create 3 = 4 − 1
dummy variables as
( ( (
1 if type B 1 if type C 1 if type D
x1 = , x2 = , x3 = (2.26)
0 if not type B 0 if not type C 0 if not type D
Qualitative inputs can be handled similarly in other problems and methods as well, such as logistic
regression, k-NN, deep learning, etc.
2.6 Regularization
Even though the linear regression model at a first glance (cf. Figure 2.1) may seem as a fairly rigid and
non-flexible model, it is not necessarily so. If more features are obtained by extending the model with
nonlinear transformations as in Figures 2.3 or 2.4, or if the number of inputs p is large and the number of
data points n is small, one may experience overfitting. If considering data as consisting of ‘signal’ (the
actual information) and ‘noise’ (measurement errors, irrelevant effects, etc.), the term overfitting indicates
that the model is fitted not only to the ‘signal’ but also to the ‘noise’. An example of overfitting is given in
Example 2.5, where a linear regression model with p = 8 RBF kernels is learned from n = 9 data points.
Even though the model follows all data points very well, we can intuitively judge that the model is not
particularly useful: neither the interpolation (between the data points) nor the extrapolation (outside the
data range) appears sensible. Note that using p = n − 1 is an extreme case, but the conceptual problem
3
If the output variable is qualitative, then we have a classification—and not a regression—problem.
19
2 The regression problem and linear regression
with overfitting is often present also in less extreme situations. Overfitting will be thoroughly discussed
later in Chapter 5.
A useful approach to handle overfitting is regularization. Regularization can be motivated by ‘keeping
the parameters β small unless the data really convinces us otherwise’, or alternatively ‘if a model with
small values of the parameters β fits the data almost as well as a model with larger parameter values,
the one with small parameter values should be preferred’. There are several ways to implement this
mathematically, which leads to slightly different solutions. We will focus on the ridge regression and
LASSO.
For linear regression, another motivation to use regularization is also when XT X is not invertible,
meaning (2.16) has no unique solution β. b In such cases, regularization can be introduced in order to make
X X invertible and give (2.16) a unique solution. However, the concept of regularization extends well
T
beyond linear regression and can be used also when working with other types of problems and models.
For example are regularization-like methods key to obtain a good performance in deep learning, as we will
discuss in Section 7.4.
In ridge regression (also known as Tikhonov regularization, L2 regularization, or weight decay) the least
squares criterion (2.16) is replaced with the modified minimization problem
The value γ ≥ 0 is referred to as regularization parameter and has to be chosen by the user. For γ = 0 we
recover the original least squares problem (2.16), whereas if we let γ → ∞ we will force all parameters
βj to approach 0. A good choice of γ is in most cases somewhere in between, and depends on the actual
problem. It can either be found by manual tuning, or in a more systematic fashion using cross-validation.
It is actually possible to derive a version of the normal equations (2.17) for (2.28), namely
b = XT y,
(XT X + γIp+1 )β (2.29)
where Ip+1 is the identity matrix of size (p + 1) × (p + 1). If γ > 0, the matrix XT X + γIp+1 is always
invertible, and we have the closed form solution
2.6.2 LASSO
With LASSO (an abbreviation for Least Absolute Shrinkage and Selection Operator), or equivalently L1
regularization, the least squares criterion (2.16) is replaced with
where k · k1 is the Manhattan norm. Contrary to ridge regression, there is no closed-form solution available
for (2.31). It is, however, a convex problem which can be solved efficiently by numerical optimization.
As for ridge regression, the regularization parameter γ has to be chosen by the user also in LASSO:
γ = 0 gives the least squares problem and γ → ∞ gives β = 0. Between these extremes, however, LASSO
and ridge regression will result in different solutions: whereas ridge regression pushes all parameters
β0 , β1 , . . . , βp towards small values, LASSO tends to favor so-called sparse solutions where only a few
of the parameters are non-zero, and the rest are exactly zero. Thus, the LASSO solution can effectively
‘switch some of the inputs off’ by setting the corresponding parameters to zero and it can therefore be used
as an input (or feature) selection method.
20
2.6 Regularization
We consider the problem of learning a linear re- Model learned with least squares
Data
gression model (blue line) with p = 8 radial basis 1
function (RBF) kernels as features from n = 9 data
output y
points (black dots). Since we have p = n − 1, we 0
output y
0 0
−1 −1
−8 −4 0 4 8 −8 −4 0 4 8
input x input x
(b) The same model, this time learned with ridge (c) The same model again, this time learned with
regression (2.28) with a certain value of γ. Despite LASSO (2.31) with a certain value of γ. Again,
not being perfectly adapted to the training data, this this model is not perfectly adapted to the training
model appears to give a more sensible trade-off data, but appears to have a more sensible trade-off
between fitting the data and avoiding overfitting between fitting the data and avoiding overfitting
than (a), and is probably more useful in most than (a), and is probably also more useful than (a)
situations. The parameter values βb are now roughly in most situations. In contrast to (b), however, 3
evenly distributed in the range from −0.5 to 0.5. (out of 9) parameters in this model are exactly 0,
and the rest are in the range from −1 to 1.
Ridge Regression and LASSO are two popular special cases of regularization for linear regression. They
both have in common that they modify the cost function, or optimization objective, of (2.16). They can be
seen as two instances of a more general regularization scheme
Note that (2.32) contains three important elements: (i) one term which describes how well the model fits
to data, (ii) one term which penalizes model complexity (large parameter values), and (iii) a trade-off
parameter γ between them.
21
2 The regression problem and linear regression
Linear regression has now been used for well over 200 years. It was first introduced independently by
Adrien-Marie Legendre in 1805 and Carl Friedrich Gauss in 1809 when they discovered the method of least
squares. The topic of linear regression is due to its importance described in many textbooks in statistics
and machine learning, such as Bishop (2006), Gelman et al. (2013), Hastie, Tibshirani, and Friedman
(2009), and Murphy (2012). While the basic least squares technique has been around for a long time, its
regularized versions are much younger. Ridge regression was introduced independently in statistics by
Hoerl and Kennard (1970) and in numerical analysis under the name of Tikhonov regularization. The
LASSO was first introduced by Tibshirani (1996). The recent monograph by Hastie, Tibshirani, and
Wainwright (2015) covers the development related to the use of sparse models and the LASSO.
b = XT y.
XT Xβ
in different ways. We will present one based on (matrix) calculus and one based on geometry and linear
algebra.
No matter how (2.17) is derived, if XT X is invertible, it (uniquely) gives
b = (XT X)−1 XT y,
β
Let
∂
V (β) = −2XT y + 2XT Xβ. (2.34)
∂β
∂ b = 0 ⇔ −2XT y + 2XT Xβ = 0 ⇔ XT Xβ
V (β) b = XT y, (2.35)
∂β
22
2.A Derivation of the normal equations
kXβ − yk22 = kXβ − (y⊥ + yk )k22 = k(Xβ − yk ) − y⊥ k22 ≥ ky⊥ k22 , (2.36)
This implies that if we choose β such that Xβ = yk , the criterion kXβ − yk22 must have reached its
minimum. Thus, our solution βb must be such that Xβ
b − y is orthogonal to the (sub)space spanned by all
columns ci , i.e.,
b T cj = 0, j = 1, . . . , p + 1
(y − Xβ) (2.38)
(remember that two vectors u, v are, by definition, orthogonal if their scalar product, uT v, is 0.) Since the
columns cj together form the matrix X, we can write this compactly as
b T X = 0,
(y − Xβ) (2.39)
where the right hand side is the p + 1-dimensional zero vector. This can equivalently be written as
b = XT y,
XT Xβ
23
3 The classification problem and three parametric
classifiers
We will now study the classification problem. Whereas the regression problem has quantitative outputs,
classification is the situation with qualitative outputs. A method that performs classification is referred to
as a classifier. Our first classifier will be logistic regression, and we will in this chapter also introduce
the linear and quadratic discriminant analysis classifiers (LDA and QDA, respectively). More advanced
classifiers, such as classification trees, boosting and deep learning, will be introduced in the later chapters.
where y is the output (1, 2, . . . , or K) and x is the input. Note that we use p(y | x) to denote probability
masses (y qualitative) as well as probability densities (y quantitative). In words, p(y | x) describes the
probability for the output y (a class label) given that we know the input x. This probability will be a
cornerstone from now on, so we will first spend some effort to understand it well. Talking about p(y | x)
implies that we think about the class label y as a random variable. Why? Because we choose to model
the real world, from where the data originates, as involving a certain amount of randomness (cf. ε in
regression). Let us illustrate with an example:
1
In Chapter 6 we will use k = 1 and k = −1 instead.
25
3 The classification problem and three parametric classifiers
If we are to describe voting preferences (= y, the qualitative output) among different population
groups (= x, the input), we have to face that all people in a certain population group will not
vote for the same political party. To describe this mathematically, we can think of y as a random
variable which follows a certain probability distribution. If we knew that the vote count in the
group of 45 year old women (= x) is 13% for the cerise party, 39% for the turquoise party and
48% for the purple party, we could describe it as
In this way, we use probabilities p(y | x) to describe the non-trivial fact that
(a) not all 45 year old women vote for the same party, but
(b) the choice of party does not appear to be completely random among 45 year old women
either; the purple party is the most popular, and the cerise party is the least popular.
We now have a model for p(y = 1 | x) and p(y = 0 | x), which contains unknown parameters β that can
be learned from training data. That is, we have constructed a binary classifier, which is called logistic
regression.
26
3.2 Logistic regression
0.8
0.6
h(z)
0.4
0.2
0
−10 −8 −6 −4 −2 0 2 4 6 8 10
Remark 3.1 Despite its name, logistic regression is a method for classification, not regression! The
(confusing) name is due only to historical reasons.
This is the function which we would like to optimize with respect to β, cf. (3.5). For numerical reasons,
it is often better to optimize the logarithm of ` (β) (since the logarithm is a monotone function, the
maximizing argument is the same),
X T
X T
log ` (β) = β T xi − log 1 + eβ xi − log 1 + eβ xi
i:yi =1 i:yi =0
X n T
= yi β T xi − log 1 + eβ xi
. (3.8)
i=1
The simplification in the second equality relies on the chosen labeling, that yi = 0 or yi = 1, which is
indeed the reason for why this labeling is convenient.
A necessary condition for the maximum of log ` (β) is that its gradient is zero,
n
!
X T
e β xi
∇β log ` (β) = xi yi − = 0. (3.9)
1 + e β T xi
i=1
2
We now add β to the expression p(y | x), to explicitly show its dependence also on β.
27
3 The classification problem and three parametric classifiers
Note that this equation is vector-valued, i.e., we have a system of p + 1 equations to solve (with p + 1
unknown elements of the vector β). Contrary to the linear regression model (with Gaussian noise) in
Section 2.3.1, this maximum likelihood problem results in a nonlinear system of equations, lacking a
general closed form solution. Instead, we are forced to use a numerical solver, as discussed in Appendix B.
The standard choice is to use the Newton–Raphson algorithm (equivalent to the so-called iteratively
reweighted least squares algorithm), see e.g. Hastie, Tibshirani, and Friedman 2009, Chapter 4.4.
The equation β T x = 0 parameterizes a (linear) hyperplane. Hence, the decision boundaries in logistic
regression always have the shape of a (linear) hyperplane.
We distinguish between different types of classifiers by the shape of their decision boundary: Since
logistic regression only has linear decision boundaries, it is consequently called a linear classifier.
28
3.2 Logistic regression
decision
yb? = 0 boundary yb? = 1
0.8
p(y = 1 | x? )
0.6
0.4
0.2
p(y = 0 | x? )
0
−15 −10 −5 0 5 10 15
x?
Figure 3.2: Consider binary classification (y = 0 or 1) when the input x is scalar (horizontal axis). Once β is
learned from training data (not shown), logistic regression gives us a model for p(y = 1 | x? ) (blue) and p(y = 0 | x? )
(red) for any test input x? . To turn these modeled probabilities into actual class predictions (b
y? is either 0 or 1), the
class which is modeled to have the highest probability is taken as the prediction. The point(s) where the prediction
changes from from one class to another is called the decision boundary (dashed vertical line).
x2
x2
x1 x1
(a) Logistic regression for K = 2 classes always gives a (b) Logistic regression for K = 3 classes. We have now
linear decision boundary. The red dots and green circles introduced training data from a third class, marked with
are training data from different classes, and the intersection blue crosses. The decision boundary between any pair of
between the red and green fields is the decision boundary two classes is still linear.
obtained for the logistic regression classifier learned from
the training data.
29
3 The classification problem and three parametric classifiers
where yik are the elements of the one-hot encoding vectors. We will not pursue any more details here, but
similarly to the binary case this likelihood function can also be used as objective function in numerical
optimization. The particular form of (3.16) will appear every time we use the one-hot encoding, and is
sometimes referred to as cross-entropy.
30
3.3 Linear and quadratic discriminant analysis (LDA & QDA)
Time to reflect 3.1: The softmax solution is actually slightly over-parameterized, compared to
binary logistic regression (3.4). That is not a problem in practice, but consider the case K = 2
and see if you can spot it!
Remark 3.2 The one-hot encoding will later be useful in deep learning. We will, however, not use it
for all multiclass methods. LDA, QDA and k-NN, for example, all use the vanilla encoding also for the
multi-class problem.
The left hand side, p(y | x), is our core interest in classification. In a practical machine learning problem,
neither the left nor the right hand side of (3.17) is known to us; we only have training data, and no one
provides us with any equations. In logistic regression, we went straight to the left hand side and modeled
that as (3.4). In LDA and QDA, instead, we focus on the right hand side, by assuming that p(x | y) has
a Gaussian distribution (no matter what the data actually looks like). Since this is now a distribution
over the input4 x, and x usually has more than one dimension, p(x | y) has to be a multivariate Gaussian
distribution with a mean vector and a covariance matrix.
Of course, we want a classifier which learns something from the training data. That is done by learning
the parameters of the Gaussian distribution, the mean vector µ and the covariance matrix Σ, from the
training data. In LDA the mean vector µ is assumed to be different for each class, but the covariance
matrix Σ is assumed to be the same for all classes. In QDA, on the other hand, are both the mean vector
µ and the covariance matrix Σ assumed to be different for each class. Since the mean vectors and the
covariance matrices will be learned from data, we will append them with a hat symbol, µ b and Σ.
b
The right hand side of Bayes’ theorem (3.17) also includes the factor p(y), which usually is unknown as
well. The meaning of this term is the probability that a completely random data point has label y (without
knowing its inputs x). As an approximation of p(y), the occurrence of class k in the training data, denoted
bk , is used p(y) ≈ p(k). For example, if 22% of the training data has label 1, we approximate p(1) as
π
b1 = 0.22, etc.
π
Thus, in LDA p(y | x) is modeled as
bk N x? | µ
π bk, Σ b
p(y = k | x? ) = P , (3.18)
K
b
π j N x ? | b
µ j , b
Σ
j=1
31
3 The classification problem and three parametric classifiers
x2
x2
Level curves for p(x | y = 0) Level curves for p(x | y = 0)
Level curves for p(x | y = 1) Level curves for p(x | y = 1)
Training data points xi with label yi = 0 Training data points xi with label yi = 0
Training data points xi with label yi = 1 Training data points xi with label yi = 1
x1 x1
(a) In LDA, it is assumed that the input x, for a given (b) Also in QDA, it is assumed that the input x, for a
output y, is distributed as a Gaussian distribution. The given output y, is distributed as a Gaussian distribution,
mean is different for each class y, but the covariance is the but for QDA both the mean and the covariance can differ
same for all classes. This plot shows how we assume that between the classes. This plot shows how we assume that
the training data looks like, when we derive LDA. the training data looks like, when we derive QDA.
Figure 3.4: LDA and QDA is derived by assuming that p(x | y) has a Gaussian distribution. This means that we
think about the input variables as random and assume that they have a certain distribution. In LDA (left panel, a),
we assume that the covariance of the input distribution (shape of the level curves) is the same for all classes, and
they only differ in locations. QDA (right panel, b) assumes that the covariance can be different for different classes.
In fact, when using LDA and QDA in practice, these assumptions on how the inputs x are distributed are rarely
satisfied, but this is nevertheless the way we motivate the methods.
In full analogy, for QDA we have (the only difference is the covariance matrix Σ)
b
bk N x? | µ
π bk, Σ bk
p(y = k | x? ) = P . (3.19)
K
b
πj N x ? | b
µ j , b
Σ j
j=1
Note that we have not made any restriction on K, but we can use LDA and QDA for binary as well as
multi-class classification. We will now discuss some aspects in more detail.
We have derived LDA and QDA by studying Bayes’ theorem (3.17). We are ultimately interested in the
left hand side of (3.17), and we went there by making an assumption about the right hand side, namely
that p(x | y) has a Gaussian distribution. In most practical cases that assumption does not hold in reality
(or, at least, it is hard for us to verify whether it holds or not), but LDA as well as QDA turns out to be
useful classifiers even when that assumption does not hold.
How do we go about in practice, if we want to learn an LDA or QDA classifier from training data
{xi , yi }ni=1 (without knowing something about the real distribution p(x | y)) and use it to make a prediction?
32
3.3 Linear and quadratic discriminant analysis (LDA & QDA)
where nP
k is the number of training data samples in class k. Consequently, all nk must sum to n, and
thereby k πbk = 1. Further, the mean vector µk of each class is learned as
1 X
bk =
µ xi , (3.20b)
nk
i:yi =k
the empirical mean among all training samples of class k. For LDA, the common covariance matrix Σ for
all classes is usually learned as
K
1 X X
b =
Σ b k )T
b k )(xi − µ
(xi − µ (3.20c)
n−K
k=1 i:yi =k
which can be shown to be an unbiased estimate of the covariance matrix5 . For QDA, one covariance
matrix Σk has to be learned for each class k = 1, . . . , K, usually as
1 X
bk =
Σ b k )T ,
b k )(xi − µ
(xi − µ (3.20d)
nk − 1
i:yi =k
Remark 3.3 To derive the learning of LDA and QDA, we did not make use of the maximum likelihood
idea, in contrast to linear and logistic regression. Furthermore, learning LDA and QDA amounts to
inserting the training data into the closed-form expressions (3.20), similar to linear regression (the normal
equations), but different from logistic regression (which requires numerical optimization).
Making predictions
We summarize this by algorithm 2 and 3, and illustrate by Figure 3.5 and 3.6.
5
This means that the if we estimate Σ
b like this for new training data over and over again, the average would be the true covariance
matrix of p(x).
33
3 The classification problem and three parametric classifiers
p(x | y) p(y)
0.6
y=1 y=2 y=3
0.6 b2
π
0.4
0.4
b1
b b b
π
σ σ σ
0.2 b3
0.2 π
0 0
−3 b1
µ −1 b2
µ 1 b3
µ 3 y=1 y=2 y=3
Bayes’ theorem
| y)p(y)
p(y | x) = Í Kp(xp(x | k)p(k)
k=1
p(y | x)
1
y=1 y=2 y=3
0.5
0
−3 −1 1 3
x
Figure 3.5: An illustration of LDA for K = 3 classes, with dimension p = 1 of the input x. In the upper left panel
is the Gaussian model of p(x | k) shown, parameterized by µ b k and Σ.
b The parameters µ b k and Σ,
b as well as πbk , are
learned from training data, not shown in the figure. (Since p = 1, we only have a scalar variance Σ2 , instead of a
covaraince matrix Σ). In the upper right panel is π bk , an approximation of p(k), shown. These are used in Bayes’
theorem to compute P (k | x), shown in the bottom panel. We take the final prediction as the class which is modeled
to have the highest probability, which means the topmost solid colored line in the bottom plot (e.g., the prediction for
x? = 0.7 would be yb = 2 (green)). The decision boundaries (vertical dotted lines in the bottom plot) are hence
found where the solid colored lines are intersecting.
34
3.3 Linear and quadratic discriminant analysis (LDA & QDA)
p(x | y) p(y)
y=2
1 0.6 b2
π
y=1 0.4
b2
σ b1
π
0.5
0.2 b3
π
b1
σ y=3
0 b3
σ 0
−3 b1
µ −1 b2
µ 1 b3
µ 3 y=1 y=2 y=3
Bayes’ theorem
p(y | x)
1
y=1 y=2 y=3
0.5
0
−3 −1 1 3
x
Figure 3.6: An illustration of QDA for K = 3 classes, in the same fashion as Figure 3.5. However, in contrast to
LDA in Figure 3.5, is the learned variance Σb k of p(x | k) different for different k (upper left panel). For this reason
can the resulting decision boundaries (bottom panel) be more complicated than for LDA, note for instance the small
slice of yb = 3 (blue) inbetween yb = 1 (red) and yb = 2 (green) around −0.5.
Once we have learned the parameters from training data, we can compute yb? for a test input x? by inserting
everything into (3.18) for each class k, and take the prediction as the class which is predicted to have the
highest probability p(y | x). As it turns out, the equations (3.18) and (3.19) are simple enough so that we
can, by only using pen and paper, say something about the decision boundary, i.e., the boundary (in the
input space) where the predictions shift between different classes.
35
3 The classification problem and three parametric classifiers
If we note that neither the logarithm nor terms independent of k change the location of the maximizing
argument (arg maxk ), we can for LDA write
The function δkLDA (x) on the last row is sometimes referred to as the discriminant function. The points x on
the boundary between two class predictions, say k = 0 and k = 1, is characterized by δ0LDA (x) = δ1LDA (x),
i.e., the decision boundary between two classes 0 and 1 can be written as the set of points x which fulfills
From linear algebra, we know that {x : xT A = c}, for some matrix A and some constant c, defines a
hyperplane in the x-space. Thus, the decision boundary for LDA is always linear, and hence its name,
linear discriminant analysis.
For QDA we can do a similar derivation
1 b k − 1 µT Σ
b −1 b k + xT Σ
b −1 µ 1 T b −1
ybQDA = arg max log πk − log det Σ k k µ k b k − x Σk x . (3.25)
k | 2 2 {z 2 }
,δkQDA (x)
and set δ0QDA (x) = δ1QDA (x) to find the decision boundary as the set of points x for which
1 b0 − 1µ b −1 b 0 + xT Σb −1 µ 1 T b −1
b0 −
log π log det Σ bT
0 Σ0 µ 0 b 0 − x Σ0 x
2 2 2
1 b 1 T b −1 T b −1 1 b −1
= log π1 − log det Σ1 − µ1 Σ1 µ b 1 + x Σ1 µ b 1 − xT Σ 1 x
2 2 2
b −1 µ
⇔ xT ( Σ b −1 b 1 ) − 1 xT (Σb −1 + Σ b −1 )x
0 b 0 − Σ1 µ 0 1
2
1
b1 − log π
= log π b0 − log det Σ b 1 + 1 log det Σb0 − 1 µ bT b −1 µ
Σ b 1 − b
µ T b −1
Σ b
µ 0 , (3.26)
1 1 0 0
| 2 2 {z 2 }
constant (independent of x)
This is now on the format {x : xT A + xT Bx = c}, a quadratic form, and the decision boundary for
QDA is thus always quadratic (and thereby also nonlinear!), which is the reason for its name quadratic
discriminant analysis.
36
3.3 Linear and quadratic discriminant analysis (LDA & QDA)
x2
x2
x1 x1
(a) LDA for K = 2 classes always gives a linear decision (b) LDA for K = 3 classes. We have now introduced
boundary. The red dots and green circles are training data training data from a third class, marked with blue crosses.
from different classes, and the intersection between the The decision boundary between any two pair of classes is
red and green fields is the decision boundary obtained for still linear.
an LDA classifier learned from the training data.
x2
x2
x1 x1
(c) QDA has quadratic (i.e., nonlinear) decision boundaries, (d) With K = 3 classes are the decision boundaries for
as in this example where a QDA classifier is learned from QDA possibly more complex than with LDA, as in this
the shown training data. case (cf. (b)).
Figure 3.7: Examples of decision boundaries for LDA and QDA, respectively. This can be compared to Figure 3.3,
where the decision boundary for logistic regression (with the same training data) is shown. LDA and logistic
regression both have linear decision boundaries, but they are not identical.
37
3 The classification problem and three parametric classifiers
p(y | x)
p(y | x)
0.5
0.5
0
y=0 y=1 0
y=1 y=2 y=3 y=4
Figure 3.8: Bayes’ classifier: The probabilities p(y | x) are shown as the height of the bars. Bayes’ classifier says
that if we want to make as few misclassificiations as possible, on the average, we should predict yb as the class which
has highest probability.
Assume that we want to design a classifier which, on the average, makes as few misclassification errors as
possible. That means that the predicted output label yb should equal the true output label y for as many test
data points as possible. If we knew the probabilities p(y | x) exactly (in logistic regression, LDA, QDA,
and all other classifiers, we only have a model—a guess—for p(y | x), we never know it exactly), then the
optimal classifier is given by
yb = arg max p(y = k | x? ). (3.27)
k
Or, in words, the optimal classifier predicts yb as the label which has the highest probability given the input
x. The optimal classifier, equation (3.27), is the Bayes’ classifier. This is illustrated by Figure 3.8. Let us
first show why this is optimal, and then discuss how it connects to the other classifiers.
Making, on the average, as many correct predictions as possible means that we want yb (which is not
random, but we can choose it ourselves) to be as likely as possible to equal y (which is random). How
likely yb is to equal y can be expressed mathematically using the expected value over the distribution for the
random variable y, which we write as Ey∼p(y | x) . Using the indicator function I{} (one when its argument
is true, otherwise zero) we can write
K
X
Ey∼p(y | x) [I{b
y = y}] = I{b
y = k}p(y = k | x) = p(b
y | x), (3.28)
k=1
where we used the definition of expected value in the first step, and ignored all terms equal to zero in the
second step. In order to make this quantity as large as possible, we can see that we should select yb such
that p(b
y | x) is as large as possible, which also is what we claimed in (3.27).
38
3.4 Bayes’ classifier — a theoretical justification for turning p(y | x) into yb
• In linear and quadratic discriminant analysis (LDA and QDA), p(y | x) is computed by Bayes’
theorem (3.17), in which p(x | y) is assumed to be a Gaussian distribution (with mean and variance
learned from the training data) and p(y) is taken as the empirical distribution of the training data.
• In k-nearest neighbor (k-NN), p(y | x) is modeled as the empirical distribution in the k-nearest
samples in the training data.
• In tree-based methods, p(y | x) is modeled as the empirical distribution among the training data
samples in the same leaf node.
• In deep learning, p(y | x) is modeled using a deep neural network and a softmax function.
All these classifiers use different ways to model/learn/approximate p(y | x), and the default choice6 is
thereafter to predict yb according to Bayes’ classifier (3.27), that is, pick the prediction as the class y which
is modeled to have the highest probability p(y | x). With only two classes this means that the prediction is
taken as the class which is modeled to have probability > 0.5.
• Bayes’ classifier is optimal only if the goal is to make as few misclassifications as possible. It might
sound as an obvious goal, but that is not always the case! If we are to predict the health status of a
patient, falsely predicting yb =‘well’ might be much more severe than falsely predicting yb =‘bad’
(or, perhaps, vice versa?). Sometimes the goal is therefore asymmetric, and Bayes’ classifier (3.27)
is not optimal for such situations.
• Bayes’ classifier is guaranteed to be optimal only if we know p(y | x) exactly. However, when
we only have an approximation of p(y | x), it is not guaranteed that (3.27) is the best thing to do
anymore.
6
Sometimes this is not very explicit in the method, but if you look carefully, you will find it.
39
3 The classification problem and three parametric classifiers
3.5.2 Regularization
As with linear regression (Section 2.6), overfit might be a problem if n (the number of training data
samples) is not much bigger than p (the number of inputs). We will define and discuss overfitting in more
detail in Chapter 5. However, regularization can be useful also in classification to avoid overfit. A common
regularization approach for logistic regression is a Ridge Regression-like penalty for β, cf. (2.28). For
LDA and QDA, it can be useful to regularize the covariance matrix estimation ((3.20d) and (3.20c)).
(i) Most of the data is usually y = 0, meaning that a classifier which always predicts yb = 0 might
score well if we only care about the number of correct classifications (accuracy). Indeed, a medical
support system which always predicts ”healthy” is probably correct most of the time, but nevertheless
useless.
(i) A missed detection (predicting yb = 0, when in fact y = 1) might have much more sever consequences
than a false detection (predicting yb = 1, when in fact y = 0).
For such classification problems, there is a set of analysis tools and terminology which we will introduce
now.
40
3.5 More on classification and classifiers
Ratio Name
FP/N False positive rate, Fall-out, Probability of false alarm
TN/N True negative rate, Specificity, Selectivity
TP/P True positive rate, Sensitivity, Power, Recall, Probability of detection
FN/P False negative rate, Miss rate
TP/P* Positive predictive value, Precision
FP/P* False discovery rate
TN/N* Negative predictive value
FN/N* False omission rate
P/n Prevalence
(TN+TP)/n Accuracy
Table 3.1: Common terminology related to the quantities (TN, FN, FP, TP) in the confusion matrix.
Confusion matrix
If one learns a binary classifier and evaluates it on a test dataset, a simple yet useful way to visualize the
result is a confusion matrix. By separating the test data in four groups depending on y (the actual output)
and yb (the output predicted by the classifier), we can make the following table
ROC curve
As suggested by the example above, the tuning of the threshold t in (3.30) can be crucial for the performance
in binary classification. If we want to compare different classifiers (say, logistic regression and QDA) for a
certain problem beyond the specific choice of t, the ROC curve can be useful. The abbreviation ROC
means ”receiver operating characteristics”, and is due to its history from communications theory.
To plot an ROC curve, the true positive rate (TP/P) is drawn against the false positive rate (FP/N) for
all values of t ∈ [0, 1]. The curve typically looks as shown in Figure 3.9. An ROC curve for a perfect
classifier (always predicting the correct value with full certainty) touches the upper left corner, whereas a
classifier which only assigns random guesses gives a straight diagonal line.
A compact summary of the ROC curve is the area under the ROC curve, AUC. From Figure 3.9, we
conclude that a perfect classifier has AUC 1, whereas a classifier which only assigns random guesses has
AUC 0.5.
41
3 The classification problem and three parametric classifiers
The thyroid is an endocrine gland in the human body. The hormones it produces influences the
metabolic rate and the protein synthesis, and thyroid disorders may have serious implications. We
consider the problem of detecting thyroid diseases, using the dataset provided by UCI Machine
Learning Repository (Dheeru and Karra Taniskidou 2017). The dataset contains 7200 data points,
each with 21 medical indicators as inputs (both qualitative and quantitative). It also contains the
qualitative diagnosis {normal, hyperthyroid, hypothyroid}, which we convert into the binary
problem with only {normal, not normal} as outputs. The dataset is split into a training and test
part, with 3772 and 3428 samples respectively. We train a logistic regression classifier on the
training dataset, and use it for predicting the test dataset (using the default t = 0.5), and obtain the
following confusion matrix:
y = normal y = not normal
yb = normal 3177 237
yb = not normal 1 13
Most test data points are correctly predicted as normal, but a large part of the not normal data is
also falsely predicted as normal. This might indeed be undesired in the application.
To change the picture, we change the threshold to t = 0.15, and obtain new predictions with the
following confusion matrix instead:
y = normal y = not normal
yb = normal 3067 165
yb = not normal 111 85
This change gives a significantly better true positive rate (85 instead of 13 patients are correctly
predicted as not normal), but this happens at the expense of a worse false positive rate (111,
instead of 1, patients are now falsely predicted as not normal). Whether it is a good trade-off
depends, of course, on the specifics of the application: which type of error has the most severe
consequences?
For this problem, only considering the total accuracy (misclassification rate) would not be very
informative. In fact, the useless predictor of always predicting normal would give an accuracy of
almost 93%, whereas the second confusion matrix above corresponds to an accuracy of 92%, even
though it probably would probably be much more useful in practice.
0.8
g t →
sin
rea
True positive rate
0.6
inc
0.4
Typical example
Perfect classifier
0.2
Random guess
0
0 0.2 0.4 0.6 0.8 1
42
4 Non-parametric methods for regression and
classification: k-NN and trees
The methods (linear regression, logistic regression, LDA and QDA) we have encountered so far all have a
fixed set of parameters. The parameters are learned from the training data, and once the parameters are
learned and stored, the training data is not used anymore and could be discarded. Furthermore, all those
methods have had a fix structure; if the amount of training data increases the parameters can be estimated
more accurately, with smaller variance, but the flexibility or expressiveness of the model does not increase;
logistic regression can only describe linear decision boundaries, no matter how much training data that is
available.
There exists another class of methods, not relying on a fixed structure and set of parameters, but which
adapts more to the training data. Two methods in this class, which we will encounter now, are k-nearest
neighbors (k-NN) and tree-methods. They can both be used for classification as well as regression, but we
will focus our presentation on the classification problem.
4.1 k-NN
The name k-nearest neighbors (k-NN) is almost self-explanatory. To approximate p(y | x? ), the proportion
of the different classes among the k number of training data points closest to x? is used. The value k is
a user-chosen integer ≥ 1 which controls the properties of the classifier, as discussed below. Formally,
we define the set R? = {i : xi is one of the k training data points closest to x? } and can then express the
classifier as
1 X
p(y = j|x? ) = I{yi = j} (4.1)
k
i∈R?
for j = 1, 2, . . . , K. Following Bayes’ classifier and always predict the class which has the largest
probability, k-NN simply amounts to a majority vote among the k nearest training data points. We explain
this by the example on the next page.
Note that in contrast to the previous methods that we have discussed there is no parametric model to be
trained, with which we then compute the prediction yb. Therefore, we talk about k-NN as a non-parametric
method. Instead, yb depends on the training data in a more direct fashion. The k-NN method can be
summarized in the following algorithm.
43
4 Non-parametric methods for regression and classification: k-NN and trees
We are given a training data set with n = 6 observations of p = 2 input variables x1 , x2 and one
(qualitative) output y, the color Red or Blue,
i x1 x2 y
1 −1 3 Red
2 2 1 Blue
3 −2 2 Red
4 −1 2 Blue
5 −1 0 Blue
6 1 1 Red
and we are interested in predicting the output for x? = [1 2]T . For this purpose, we will explore
two different k-NN classifiers, one using k = 1 and one using k = 3.
First, we compute the Euclidian distance kxi − x? k between each training data point xi and the
test data point x? , and then sort them in descending order.
4
k=3
1
i kxi − x? k yi
i=
√ k=1
6 √1 Red
i=
2
2 Blue
4
i=3
√2
x2
i=6
4 √4 Blue
i=2
1 √5 Red
5 Blue
0
√8 i=5
3 9 Red
−2 0 2
x1
Since the closest training data point to x? is the data point i = 6 (Red), it means that for k-NN
with k = 1, we get the model p(Red | x? ) = 1 and p(Blue | x? ) = 0. This gives the prediction
yb? = Red.
Further, for k = 3, the 3 nearest neighbors are i = 6 (Red), i = 2 (Blue), and i = 4 (Blue),
which gives the model p(Red | x? ) = 13 and p(Blue | x? ) = 32 . The prediction, which also can be
seen as a majority vote among those 3 training data points, thus becomes yb? = Blue.
This is also illustrated by the figure above where the training data points xi are represented with
red squares and the blue triangles depending on which class they belong to. The test data point x?
is represented with a black filled circle. For k = 1 the closest training data point is identified by
the inner circle and for k = 3 the three closest points are identified by the outer circle.
In Example 4.1 we only computed a prediction for one single test data point x? . If we would shift that
test point by one step to the left at xalt
? = [0 2] the three closest training data points would still include
T
i = 6 and i = 2 but now i = 2 is exchanged for i = 1. For k = 3 this would give the approximation
p(Red|x? ) = 23 and we would predict yb = Red. In between these two test data points x? and xalt ? at
[0.5 2]T it is equally far to i = 1 as to i = 2 and this point would consequently be located at the decision
boundary between the two classes. Continuing this way of reasoning we can sketch the full decision
boundaries in Example 4.1 which are displayed in Figure 4.1. Obviously, k-NN is not restricted to linear
decision boundaries and is therefore a nonlinear classification method.
44
4.1 k-NN
4 4
2 2
x2
x2
0 0
−2 0 2 −2 0 2
x1 x1
(a) k = 1 (b) k = 3
Figure 4.1: Decision boundaries for the problem in Example 4.1 for the two choices of the parameter k.
4.1.2 Choosing k
The user has to decide on which k to use in k-NN and this decision has a big impact on the final classifier.
In Figure 4.2 another scenario is illustrated with p = 2 input variables, K = 3 classes and significantly
more training data samples. In the two subfigures the decision boundaries for a k-NN classifier with k = 1
and k = 11 are illustrated.
By definition, with k = 1 all training data points are classified correctly and the boundaries are more
adapted to what the training data exactly looks like (including its ’noise’ and other random effects). With
the averaging procedure that takes place for the k-NN classifier with k = 11, some training data points
end up in the wrong region and the decision boundaries are less adapted to this specific realization of
the training data. Even though k-NN with k = 1 fits all training data points perfectly, the one with
k = 11 might be preferred, since it is less prone to overfit to the training data meaning there are good
reasons to believe this model would perform better on test data. A systematic way of choosing k is to use
cross-validation, and we will discuss these aspects more in Chapter 5.
x2
x2
x1 x1
(a) Decision boundary for k-NN with k = 1 for a 3-class (b) Decision boundary for k-NN with k = 11. A more
problem. A complex, but possibly also overfitted, decision rigid and less flexible decision boundary.
boundary.
4.1.3 Normalization
Finally, one practical aspects crucial for the k-NN worth mentioning is the importance of normalization
of the input data. Since k-NN is based on the Euclidean distances between points, it is important that
these distances are a valid measure of the closeness between two data points. Imagine a training data
45
4 Non-parametric methods for regression and classification: k-NN and trees
set with two input variables xi = [xi1 , xi2 ]T where all values of xi1 are in the range [0, 100] and
the values for xi2 in the much smaller range [0, 1]. This could for example be the case if xi1 and xi2
represent different physical quantities (where the values can be quite different, depending on which
unit is used). pIn this case the Euclidean distance between a test point x? and a training data point
kxi − x? k = (xi1 − x?1 )2 + (xi2 − x?2 )2 would almost only depend on the first term (xi1 − x?1 )2
and the values of the second component xi2 would have a small impact.
One way to overcome this problem is to divide the first component with 100 and create xnew
i1 = xi1 /100
such that both components are in the range [0, 1]. More generally, this normalization procedure for the
input data can be written as
xpi − min(xij )
xnew
ij = , ∀j = 1, . . . , p, i = 1, . . . , n. (4.2)
max(xij ) − min(xij )
Another popular way of normalizing is by using the mean and standard deviation in the training data:
xij − x̄j
xnew
ij = , ∀j = 1, . . . , p, i = 1, . . . , n, (4.3)
σj
where x̄j and σj are the mean and standard deviation for each input variable, respectively.
4.2 Trees
Tree-based methods divides the input space into different regions. Within each region, p(y | x) is modeled
as the empirical distribution among the training data samples in the same region. The rules to divide the
input space can be summarized in a tree, and hence these methods are known as decision trees. Trees can
be used for both regression and classification problems. Here, our description will focus on classification
trees.
4.2.1 Basics
In a classification tree the function p(y | x) is modeled with a series of rules on the input variables
x1 , . . . , xp . These rules can be represented by a binary tree. This tree effectively divides the input space
into multiple regions and in each region a constant value for the predicted class probability p(y | x) is
assigned. We illustrate this with an example.
46
4.2 Trees
Consider a problem with two input variables x1 and x2 and one quantitative output y, the color
red or blue. A classification tree for this problem can look like the one below. To use this tree
to classify a new point x? = [x?1 , x?2 ]T we will start at the top and work the way down until
we reach the end of a branch. Each such final branch corresponds to a constant predicted class
probability p(Red|x? ).
5.0
x2 < 3.0
R2 R3
x2
x1 < 5.0
R1
p(Red|x? ) = 0 3.0
R1
R2 R3
1
p(Red|x? ) = 3 p(Red|x? ) = 1 x1
A classification tree. At each internal node a rule on A region partition of the classification tree. Each
the form xj < sk indicates the left branch coming region corresponds to a leaf node in the classifi-
from that split and the right branch then consequently cation tree to the left, and each border between
corresponds to xj ≥ sk . This tree as two internal regions correspond to a split in the tree. Each
nodes and three leaf nodes. region is marked with the color representing the
highest predicted class probability.
A pseudo code for classifying a test input with the tree above would look like
As an example if we have x? = [2.5, 3.5]T , in the first split we would take the right branch since
x?2 = 3.5 ≥ 3.0 and in the second split we would take the left branch since x?1 = 2.5 < 5.0.
Consequently, the for this test point we would get p(Red|x? ) = 1/3 and hence p(Blue|x? ) = 2/3.
This classification tree can also be represented as splitting the input space into multiple rectacle-
shaped regions. This is illustrated in the left figure above.
To set the terminology, the endpoint of each branch R1 , R2 and R3 in the example are called leaf nodes
and the internal splits, x2 < 3.0 and x1 < 5.0 are known as internal nodes. The lines that connect the
nodes are referred to as branches. Note that in an example with more than two input variables, the region
partition (right figure in the example) is difficult to draw, but the tree will work out in the same way with
exactly two branches coming out of each internal node.
This example illustrates how a classification tree can be used to make predictions, but how do we learn
the tree from training data? This will be explained in the next section.
47
4 Non-parametric methods for regression and classification: k-NN and trees
where M is the total number of regions (leaf nodes) in the tree and where I{x ∈ Rm } = 1 if x ∈ Rm
and
PK0 otherwise. Since the probabilities should sum up to 1 in each region we also have the constraint
k=1 cmk = 1.
The overall goal in constructing a classification tree based on training data {xi , yi }ni=1 is to find a tree
that makes the observed training data as likely as possible. This approach is known as the maximum
likelihood method, which we also used to derive a solution to the logistic regression problem previously.
Maximizing the likelihood is equivalent of minimizing the negative logarithm of the likelihood.
Therefore, we want to find a tree T which minimizes the following expression:
Here πbmk is the proportion of training data points in region Rm that are from class k with nm being the
total number of training data points in region m. We can show1 that
K
X K
X K K
πmk X X
− bmk log cmk =
π bmk log
π − bmk ≥ −
bmk log π
π bmk log π
π bmk . (4.7)
cmk
k=1 k=1 k=1 k=1
| {z }
≥0
which is fulfilled with equality if cmk = π bmk . Hence, minimizing (4.6) with respect to cmk gives
cmk = πbmk . It remains to find the regions Rm , and based on the discussion above we want to select the
regions in order to minimize,
M
X K
X
min nm Qm (T ), where Qm (T ) = − bmk log π
π bmk (4.8)
m=1 k=1
1
We use the so called log sum inequality and the two constraints cmk = 1 and bmk = 1 for all m = 1, . . . , M .
PK PK
k=1 k=1 π
48
4.2 Trees
The split depends on the index j of the input variable at which the split is performed and the cutpoint s.
The corresponding proportions πbmk will also depend on j and s.
1 X 1 X
b1k (j, s) =
π I{yi = k}, b2k (j, s) =
π I{yi = k}.
n1 n2
i:xi ∈R1 (j,s) i:xi ∈R2 (j,s)
For each input variable we can scan through the finite number of possible splits and pick the pair (j, s)
which minimizes (4.9). After that, we repeat the process to create new splits by finding the best values
(j, s) for each of the new branches. We continue the process until some stopping criteria is reached, for
example until no region contains more than five training data points.
The tree in Example 4.2 has been constructed based on the methodology outlined above, which we will
illustrate in the example below.
We consider the same setup as in Example 4.2 with the following dataset
10
x1 x2 y
9.0 2.0 Blue 8
We want to learn a classification tree, by using the entropy criteria in (4.8) and growing the tree
until there are no regions with more than five data points left.
First split: There are infinitely many possible splits we can make, but all splits which gives the
same partition of the data points will be the same. Hence, in practice we only have nine different
2
If any π
bmk is equal to 0, the term 0 log 0 is taken to be zero, which is also consistent with the limit limr→0+ r log r = 0.
49
4 Non-parametric methods for regression and classification: k-NN and trees
splits to consider in this dataset. The data and these splits (dashed lines) are visualized in the figure
above.
We consider all nine splits in turn. We start with the split at x1 = 2.5 which splits the input
space into the two regions R1 = x1 < 2.5 and R2 = x1 ≥ 2.5. In region R1 we have two blue
data points and one red, in total n1 = 3 data points. The proportion of the two classes in region R1
will therefore be πb1B = 2/3 and π b1R = 1/3. The entropy is calculated as
2 2 1 1
Q1 (T ) = −b π1B ) − π
π1B log(b π1R ) = − log( ) − log( ) = 0.64.
b1R log(b (4.10)
3 3 3 3
In region R2 we have n2 = 7 data points with the proportions π
b2B = 3/7 and π
b2R = 4/7. The
entropy for this regions will be
3 3 4 4
Q2 (T ) = −b π2B ) − π
π2B log(b π2R ) = − log( ) − log( ) = 0.68
b2R log(b (4.11)
7 7 7 7
and the total weighted entropy for this split becomes
We compute the cost for all other splits in the same manner, and summarize it in the table below.
From the table we can read that the two splits at x2 < 3.0 and x2 < 7.0 are both equally good. We
choose to continue with x2 < 3.0.
8 8
6 6 R2 R3
x2
x2
4 4
2 2
R1 R1
0 0
0 2 4 6 8 10 0 2 4 6 8 10
x1 x1
Second split: We notice that only R2 has more than five data points. Also there is no point
splitting region R1 further since it only contains data points from the same class. In the next
step we therefore split the second region into two new regions R2 and R3 . All possible splits are
displayed above to the left (dashed lines) and we compute their cost in the same manner as before.
50
4.2 Trees
The best split is the one at x1 < 5.0 visualized above to the right. The final tree and partition were
displayed in Example 4.2. None of the three regions has more than five data points. Therefore, we
terminate the training.
If we want to use the tree for prediction, we get p(Red|x? ) = π
b1R = 0 if x? falls into region R1 ,
p(Red|x? ) = π b2R = 1/3 if falls into region R2 or p(Red|x? ) = π b3R = 1 if falls into region R3 , in
the same manner as displayed in Example 4.2.
There are other splitting criteria that can be considered instead of the entropy (4.8). One simple alternative
is the misclassification rate
Qm (T ) = 1 − max π
bmk , (4.13)
k
which is simply the proportion of data points in region Rm which do not belong to the most common class.
This sounds like a reasonable choice since it is often the misclassification rate that we use to evaluate the
final classifier. However, one drawback with is that it does not favor pure nodes in the same extent as the
entropy criteria does. With pure nodes we mean nodes where most data points belong to a certain class. It
is usually an advantage to favor pure nodes in the greedy procedure that we use to grow the tree, since it
can lead to a total of fewer splits.
For example, consider the first split in Example 4.3. If we would use the misclassification rate as
splitting criteria, both the split x2 < 5.0 as well as x2 < 3.0 would provide a total misclassification rate
of 0.2. However, the split at x2 < 3.0, which the entropy criteria favored, provides a pure node R1 . If
we would go with the split x2 < 5.0 the misclassification after the second split would still be 0.2. If we
would continue to grow the tree until no data points are misclassified we would need three splits if we
used the entropy criteria whereas we would need five splits if we would use the misclassification criteria
and started with the split at x2 < 5.0.
Another common splitting criteria is the Gini index
K
X
Qm (T ) = bmk (1 − π
π bmk ). (4.14)
k=1
Similar to the entropy criteria, Gini index favors node purity more than misclassification rate does.
If we consider two classes where r is the proportion in the second class, the three criteria are
These functions are shown in Figure 4.3. We can se that the entropy and Gini index are quite similar.
51
4 Non-parametric methods for regression and classification: k-NN and trees
Misclassification rate
0.4 Gini index
Entropy
0.2
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
r
Figure 4.3: Three splitting criteria for classification trees as a function of the proportion in class 2. The entropy
criteria has been scaled such that it passes through (0.5,0.5).
This splitting criteria can also be motivated from a maximum likelihood point of view as we did for the
classification trees. At prediction time the mean ybm in each region is used for prediction. In all other
aspects the procedure to train a regression tree is the same as training a classification tree as explained
above.
52
5 How well does a method perform?
So far we have studied different methods of how to learn models by adapting them to training data. We
hope that the models thereby will give us good predictions also when faced with new, previously unseen,
data. But can we really expect that to work? This may first sound like a trivial question, but on second
thought it is perhaps not (so) obvious anymore, and we will give it the attention it deserves in this chapter.
By doing so, we will find some interesting concepts, which will give us practical tools for evaluating and
understanding supervised machine learning methods better.
The error function E(b y , y) has similarities to a loss function. However, they are used differently: A loss
function is used to train (or learn) a model, whereas the error function is used to analyze performance of
an already trained model.
In the end every machine learning mostly cares about how a method performs when faced with an
endless stream of new, unseen data. Imagine for example all real-time recordings of street views that
have to be processed by a vision system in a self-driving car once it is sold to a customer, or all new
patients that have to be classified by a medical diagnosis system. The performance on fresh unseen data
can in mathematical terms be understood as the average of the error function—how often the classifier is
right, or how good does the regression method predict. To be able to mathematically describe the endless
stream of new data, we introduce a distribution over data p(x, y). In the previous chapters, we have
mostly considered the output y as a random variable whereas the inputs x have been thought of as fixed.
Now, we have to think of also the input x as a random variable with a certain probability distribution. In
any real-world machine learning scenario p(x, y) can be extremely complicated and really hard or even
impossible to model. That is, however, not a problem since we will only use p(x, y) to reason about
machine learning methods, and the bare notion of p(x, y) will be helpful for that.
53
5 How well does a method perform?
Remark 5.1 In Section 3.4 about Bayes’ classifier, we made a hypothetical argument about the optimal
classifier, if we had access to p(y | x) (which we usually do not have). The arguments in this section are
made from an even more hypothetical point of view, assuming that we do know not only p(y | x), but also
p(x), since p(x, y) = p(y | x)p(x). While this is often an unrealistic assumption, the reasoning will lead
us to useful insights about when and how we can expect machine learning to work.
Irrespective of which classification or regression method we are considering, once the model has been
trained on training data T = {xi , yi }ni=1 , it will provide us with predictions yb? for any new input x? we
give to it. We will in this chapter write yb(x; T ) as a function of x and T , like yb? , yb(x? ; T ), to indicate
that the prediction (via the model) depends both on the value of the test input x? and on the training data
T used to train the model.
In the previous chapters, we have mostly discussed how a model predicts one, or a few, test inputs x? .
Let us now take that to the next level, by integrating the error function (5.1) over all possible test data
points (rather than considering only one or a few) with respect to the distribution p(x, y). We refer to this
as the expected new data error
Enew , E? [E(b
y (x? ; T ), y? )] , (5.2)
where the expectation E? is the expectation over all possible test data points with respect to the distribution
(x? , y? ) ∼ p(x, y), that is,
Z
E? [E(by (x? ; T ), y? )] = E(b
y (x? ; T ), y? )p(x? , y? ) dx? dy? . (5.3)
We emphasize that the model (no matter whether it is linear regression, a decision tree, a neural network
or something else) is trained on a given training data set T and represented by yb(·; T ). What is happening
in equation (5.2) is an averaging over possible test data points (x? , y? ). Thus, Enew describes how well the
model generalizes from the training data T to new situations.
The expected new data error Enew tells us how well a method performs when we put it into production;
what proportions of predictions a classifier will get right, and how well a regression method will predict in
terms of average squared error. Or, in a more applied setting, what rate of false and missed detections of
pedestrians we can expect a vision system in a self-driving car to make, or how big a proportion of all
future patients a medical diagnosis system will get wrong.
The overall goal in supervised machine learning is to achieve as small Enew as possible.
Unfortunately, in practical cases we can never compute Enew to assess how well we are doing. The reason
is that p(x, y)—which we do not know in practice—is part of the definition of Enew . It seems, however, to
be a too important construction to be abandoned, just because we cannot compute it. We will instead
spend the remaining parts of this chapter trying to estimate Enew (essentially by replacing the integral with
a sum), and also understanding how Enew behaves, to better understand how we can decrease it.
Remark 5.2 Note that Enew is a property of a trained model and a specific machine learning problem.
Thus, we cannot talk about “Enew for QDA” in general, but instead we have to make more specific
statements, like “Enew for QDA on handwritten digit recognition, when QDA is trained with the MNIST
data1 ”.
1
http://yann.lecun.com/exdb/mnist/
54
5.2 Estimating Enew
• judging if the performance is satisfying (whether Enew is small enough), or if more work should be
put into the solution and/or more training data should be collected
• choosing between different methods
• choosing hyperparameters (such as k in k-NN, the regularization parameter in ridge regression or
the number of hidden layers in deep learning)
• reporting the expected performance to the customer
As discussed above, we can unfortunately not compute Enew in any practical situation. We will therefore
explore some possibilities to estimate Enew , which eventually will lead us to a very useful concept known
as cross-validation.
n
1X
Etrain , y (xi ; T ), yi ) ,
E(b (5.4)
n
i=1
where {xi , yi }ni=1 is the training data T . Etrain simply describes how well a method performs on the
training data on which it was trained. In contrast to Enew , we can always compute Etrain .
We usually assume that T consists of samples from p(x, y). This assumption means that the training
data is collected under similar circumstances as the ones the learned model will be used under, which
seems reasonable. (If it was not, we would have very little reasons to believe the training data would tell
us anything useful.) When an integral is hard to compute, it can be numerically approximated with a sum
(see details in Appendix A.2). Now, the question is if the integral in Enew can be well approximated by the
sum in Etrain , like
Z n
?? 1X
Enew = y (x; T ), y))p(x, y)dxdy ≈
E(b y (xi ; T ), yi ) = Etrain .
E(b (5.5)
n
i=1
Or, put differently: Can we expect a method to perform equally well (or badly) when faced with new,
previously unseen, data, as it did on the training data?
The answer is, unfortunately, no.
Time to reflect 5.1: Why can we not expect the performance on training data (Etrain ) to be a good
approximation for how a method will perform on new, previously unseen data (Enew )?
Equation (5.5) does not hold, and the reason is that the training data are not just any data points, but yb
depends on them since they are used for training the model. We cannot therefore expect (5.5) to hold.
(Technically, the conditions in Appendix A.2 are not fulfilled, since yb depends on T .)
As we will discuss more thoroughly later in Section 5.3.1, the average behavior of Etrain and Enew is, in
fact, typically Etrain < Enew . That means that a method usually performs worse on new, unseen data, than
on training data. The performance on training data is therefore not a good measure of Enew .
55
5 How well does a method perform?
Figure 5.1: The test data set approach: If we split the available data in two sets and train the model on the training
set, we can compute Etest using the test set. The more data that are in the test data set, the less variance (better
estimate) in Etest , but the less data left for training the model. The split here is only pictorial, in practice one should
always split the data randomly.
to evaluate the error function (the sum in (5.4)). A remedy is to set aside some test data {xj , yj }mj=1 ,
which are not used for training, and then use the test data only for estimating the model performance
m
1 X
Etest , y (xj ; T ), yj ) .
E(b (5.6)
m
j=1
In this way, not all data will be used for training, but some data points (the test data) will be saved and
used only for computing Etest . This is illustrated by Figure 5.1.
Be aware! If you are splitting your data into a training and test set, always do it randomly! Someone
might—intentionally or unintentionally—have sorted the data set for you. If you do not split randomly,
you might end up having only one class in your training data, and another class in your test data . . .
As long as m ≥ 1, it can be shown that Etest is an unbiased estimate of Enew (meaning that if the entire
procedure is repeated multiple times, the average value of Etest would be Enew ). That is reassuring, but
it does not tell us how close Etest will be to Enew in a single experiment. However, the variance of Etest
decreases when the size of test data m increases; a small variance of Etest means that we can expect it to
be close to Enew . Thus, if we take the test data set big enough, Etest will be close to Enew . On the other
side the amount of data available is usually limited in real machine learning problems, meaning the more
data points we put into the test set, the fewer data points are left for training. Typically the more training
data, the smaller Enew (which we will discuss later in Section 5.3). Achieving a small Enew is our ultimate
goal. We are therefore faced with the following dilemma: the better we want to know Enew (more test data
gives less variance in Etest ), the worse we have to make it (less training data increases Enew ). That is not
very satisfying.
One could suggest the following two-step procedure, to circumvent the situation:
(i) Split the available data in one training and one test set, train the model on the training data and
compute Etest using test data (as in Figure 5.1).
(ii) Train the model again, this time using the entire data set.
By such a procedure, we get both a value of Etest and a model trained using the entire data set. That is not
bad, but not perfect either. Why? To achieve small variance in Etest , we have to put lots of data in the test
data set. That means the model trained in step (i) will quite possibly be very different from the model
trained in step (ii). And Etest from step (i) is an estimate of Enew for the model from step (i), not the model
in step (ii). Consequently, if we use Etest from step (i) to, e.g., select a hyperparameter in step (i), we can
not be sure it was a wise choice for the model trained in step (ii). This procedure is, however, not very far
from cross-validation that we will present next.
56
5.2 Estimating Enew
for ` = 1, . . . , c
Validation data Training data
(1)
`=1 ··· → Eval
(2)
`=2 ··· → Eval
..
.
(c)
`=c ··· → Eval
Figure 5.2: Illustration of c-fold cross-validation. The data is split in c batches of similar sizes. When looping over
` = 1, 2, . . . , c, batch ` is held out as validation data, and the model is trained on the remaining c − 1 data batches.
Each time, the trained model is used to compute the average error Eval for the validation data. The final model is
(`)
trained using all available data, and the estimate of Enew for that model is Eval , the average of all Eval .
(`)
The idea of cross-validation is simply to repeat the test data set approach (using a small test data set)
multiple times with a different test data set each time, in the following way:
(i) split the data set in c batches of similar size (see Figure 5.2), and let ` = 1
(ii) take batch ` as validation data, and the remaining batches as training data
(`)
(iii) train the model on the training data, and compute Eval as the average error on the validation data
(analogously to (5.6))
c
1 X (`)
Eval , Eval (5.7)
c
`=1
(v) train the model again, this time using all available data points
More precisely, this procedure is known as c-fold cross-validation, and illustrated in Figure 5.2.
With c-fold cross-validation, we get a model which is trained on all data, as well as an approximation of
Enew for that model, namely Eval . Whereas Etest (Section 5.2.2) was an unbiased estimate of Enew (to
the cost of setting aside test data), Eval is only approximately unbiased. However, with c large enough, it
turns out to often be a sufficiently good approximation, and is commonly used in practice. Let us try to
understand how c-fold cross-validation works.
First, we have to distinguish between the final model, which is trained on all data in step (v), and the
intermediate models which are trained on all except a 1/c fraction of the data in step (iii). The key in
57
5 How well does a method perform?
c-fold cross-validation is that if c is large enough, the intermediate models are quite similar to the final
model (since they are trained on almost the same data set, only a fraction 1/c of the data is missing).
(`)
Furthermore, each intermediate Eval is by construction an unbiased estimate of Enew for its corresponding
(`)
intermediate model. Since the intermediate and final models are similar, Eval is approximately also
an unbiased estimate of Enew for the final model. Since the validation sets are small (only 1/c of all
(`)
available data), the variance of Eval is high, but when averaging over multiple high-variance estimates
(1) (2) (c)
Eval , Eval , . . . , Eval , the final estimate Eval (5.7) does not suffer from high variance.
We usually talk about training (or learning) as something that is done once. However, in c-fold
cross-validation the training is repeated c (or even c + 1) times. A common value for c is 10, but you
may of course try different values. For methods such as linear regression, the actual training (solving
the normal equations) is usually done within milliseconds on modern computers, and doing it an extra c
times is usually not really a problem. If one is working with computationally heavy methods, such as
certain deep neural networks, it is perhaps less appealing to increase the computational load by a factor of
c + 1. However, if one wants to get a good estimate of Enew , some version of cross-validation is most
often needed.
We now have a method for estimating Enew for a model trained on all available training data. A typical
use of cross-validation is to select different types of hyperparameters, such as k in k-NN or a regularization
parameter.
Be aware! For the same reason as with the test data approach, it is important to always split the data
randomly for cross-validation to work! A simple solution is to first randomly permute the entire data set,
and thereafter split it into batches.
58
5.3 Understanding Enew
Here, ET denotes the expected value when the training data set T = {xi , yi }ni=1 (of a fixed size n) is
drawn from p(x, y). Thus Ēnew is the average Enew if we would train the model multiple times on different
training data sets, and similarly for Ētrain . The point of introducing these, as it turns out, is that we can
say more about the average behavior of Enew and Etrain , than we can say about Enew and Etrain when the
model is trained on one specific training data set T . Even though we in practical problems very seldom
encounter Ēnew (the training data is usually fixed), the insights we gain from studying Ēnew are still useful.
Put in words, this means that on average, a method usually performs worse on new, unseen data, than on
training data. A methods ability to perform well on unseen data after being trained on training data, can be
understood as the method’s ability to generalize from training data. The difference between Enew and
Etrain is accordingly called the generalization error2 , as
The generalization error thereby gives a connection between the performance on training data and the
performance ‘in production’ on new, previously unseen data. It can therefore be interesting to understand
how big (or small) the generalization error is.
The size of the generalization error depends on the method and the problem. Concerning the method, one
can typically say that the more the model has adopted to the training data, the larger the generalization
error. A theoretical study of how much a model adopts to training data can be done using the so-called
VC dimension, eventually leading to probabilistic bounds on the generalization error. Unfortunately those
bounds are usually rather conservative, and we will not pursue that formal approach any further.3 Instead,
we only use the vague term model complexity, by which we mean the ability of a method to adopt to
complicated patterns in the training data, and reason about what we see in practice. A model with high
complexity (such as a neural network) can describe very complicated relationships, whereas a model with
low complexity (such as LDA) is less flexible in what functions it can describe. For parametric methods,
the model complexity is related to the number of parameters that are trained. Flexible non-parametric
methods (such as trees with many leaf nodes or k-NN with small k) have higher model complexity than
parametric methods with few parameters, etc. Techniques such as regularization, early stopping and
dropout (for neural networks) effectively decrease the model complexity.
2
Sometimes Enew is called generalization error; not in this text. In our terminology we do not distinguish between the
generalization error for a model trained on a certain training data set, and its training-data averaged counterpart.
3
If you are interested, a good book is Abu-Mostafa, Magdon-Ismail, and Lin 2012.
59
5 How well does a method perform?
Underfit Overfit
Ēnew
Generalizarion error
Error
Ētrain
Model complexity
Figure 5.3: Behavior of Ētrain and Ēnew for many supervised machine learning methods, as a function of model
complexity. We have not made a formal definition of complexity, but a rough proxy is the number of parameters that
are learned from the data. The difference between the two curves is the generalization error. In general, one can
expect Ētrain to decrease as the model complexity increases, whereas Ēnew typically has a U-shape. If the model
is so complex that Ēnew is larger than it had been with a less complex model, the term overfit is commonly used.
Somewhat less common is the term underfit used for the opposite situation. The level of model complexity which
gives the minimum Ēnew (at the dotted line) would in a consistent terminology perhaps be called a balanced fit. A
method with a balanced fit is usually desirable, but often hard to find since we do not know neither Ēnew nor Enew in
practice.
Typically, higher model complexity implies larger generalization error. Furthermore, Ētrain usually
decreases as the model complexity increases, whereas Ēnew attains a minimum for some intermediate
model complexity value: too small and too high model complexity both raises Ēnew . This is illustrated
in Figure 5.3. The region where Ēnew is larger than its minimum due to too high model complexity is
commonly referred to as overfit. The other region (where Ēnew is larger than its minimum due to too small
model complexity) is sometimes referred to as underfit. In a consistent terminology, the point where Ēnew
attains it minimum could be referred to as a balanced fit. Since the goal is to minimize Ēnew , we are
interested in finding this point. We also illustrate this by Example 5.1.
Remark 5.3 This and the next section discuss the usual behavior of Ēnew , Ētrain and the generalization
error. We use the term ‘usually’ because there are so many supervised machine learning methods and
problems that it is almost impossible to make any claim that is always true for all possible situations.
Pathological counter-examples may exist. One should also keep in mind that claims about Ētrain and Ēnew
are about the average behavior, which hopefully is clear in Example 5.1.
60
5.3 Understanding Enew
x2 0
−1
−1 0 1
x1
We generate n = 100 samples as training data, and learn three classifiers: a logistic regression
classifier, a QDA classifier and a k-NN classifier with k = 2. If we are to rank these methods
in model complexity order, logistic regression is simpler than QDA (logistic regression is a
linear classifier, whereas QDA is more general), and QDA is simpler than k-NN (since k-NN is
non-parametric and can have rather complicated decision boundaries). We plot their decision
boundaries, together with the training data:
Logistic regression QDA k-NN, k = 2
1 1 1
0 0 0
x2
x2
x2
−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
x1 x1 x1
For each of these three classifiers, we can compute Etrain by simply counting the fraction of
training data points that are on the wrong side of the decision boundary. From left to right, we
get Etrain = 0.17, 0.16, 0.11. Since we are in a simulated example, we can also access Enew
(or rather estimate it numerically by simulating a lot of test data), and from left to right we get
Enew = 0.22, 0.15, 0.24. This pattern resembles Figure 5.3, except for the fact that Enew is smaller
than Etrain for QDA. Is this unexpected? Not really, what we have discussed in the main text is the
average Ēnew and Ētrain , not the situation with Enew and Etrain for one particular set of training
data. We therefore repeat this experiment 100 times, and compute the average Ēnew and Ētrain over
those 100 experiments:
Logistic regression QDA k-NN with k = 2
Ētrain 0.17 0.14 0.10
Ēnew 0.18 0.15 0.19
This follows Figure 5.3 well: The generalization error (difference between Ēnew and Ētrain ) is
positive and increases with model complexity, Ētrain decreases with model complexity, and Ēnew
has its minimum for QDA. This suggests that k-NN with k = 2 suffers from overfitting for this
problem, whereas logistic regression is a case of underfitting.
61
5 How well does a method perform?
Ēnew
Ēnew
Error
Error
Ētrain
Ētrain
Figure 5.4: Typical relationship between Ēnew , Ētrain and the number of data points n in the training data set. The
generalization error (difference between Ēnew and Ētrain ) decreases, at the same time as Ētrain increases. Typically, a
more complex model (right panel) will for large enough n attain a smaller Ēnew than a simpler model (left panel)
would on the same problem (the axes of the figures are comparable). However, the generalization error is typically
larger for a more complex model, in particular when there is little training data n.
The previous section and Figure 5.3 are concerned about the relationship between Ēnew , Ētrain , the
generalization error (their difference) and the model complexity. Yet another important aspect is the
size of the training data set, n. Intuitively, one may expect that the more training data, the better the
possibilities to learn how to generalize. Yet again, we do not make a formal derivation, but we can in
general expect that the more training data, the smaller the generalization error. On the other hand, Ētrain
typically increases as n increases, since most models are not able to fit all training data perfectly if there
are too many of them. A typical behavior of Ētrain and Ēnew is sketched in Figure 5.4.
We will now introduce another decomposition of Enew into the terms known as bias and variance (which
we can affect by our choice of method) as well as an unavoidable component of irreducible noise. This
decomposition is most natural in the regression setting, but the intuition carries over also to classification.
We first make the assumption that the true relationship between input and output can be described as
some (possibly very complicated) function f (x) plus independent noise ε,
Since we have made no restriction on f , this is not a very restrictive assumption, but we can expect it to
describe the reality well.
In our notation, yb(x; T ) represents the model when it is trained on training data T . We now also
introduce the average trained model
g(x) , ET [b
y (x; T )] . (5.12)
As before, ET denotes the expected value over training data drawn from p(x, y). Thus, g(x) is the
(hypothetical) average model we would achieve, if we could marginalize all random effects associated
with the training data.
62
5.3 Understanding Enew
Underfit Overfit
Ēnew
r ror
e du c ible e
Irr
Error
Variance
Bias2
Model complexity
Figure 5.5: The bias-variance decomposition of Ēnew (cf. Figure 5.3). The bias typically decreases with model
complexity; the more complicated the model is, the less systematic errors in the predictions. The variance, on the
other hand, typically increases as the model complexity grows; the more complex the model is, the more it will adapt
to peculiarities that by chance happened to occur in the particular training data set that was used. The irreducible
error is always constant. In order to achieve a small Enew , one has to trade between bias and variance (for example
by using another model, or using regularization as in Example 5.2). in order to avoid over- and underfit.
We are now ready to rewrite Ēnew , the average expected new data error, as
h h ii h h ii
Ēnew = ET E? (b y (x? ; T ) − y? )2 = E? ET (b y (x? ; T ) − f (x? ) − ε)2
h h i i
= E? ET (b y (x? ; T ))2 − 2ET [b y (x? ; T )] f (x? ) + f (x? )2 + σ 2
h i
= E? ET [(by (x? ; T ))2 ] − g(x? )2 + g(x? )2 − 2g(x? )f (x? ) + f (x? )2 + σ 2 . (5.13)
| {z } | {z }
y (x? ;T )−g(x? ))2 ]
ET [(b (g(x? )−f (x? ))2
The irreducible error is simply an effect of the assumed intrinsic stochasticity of the problem – it is not
possible to predict ε since it is truly random. We will hence leave the irreducible error as it is, and focus
on the bias and variance terms to further understand how Ēnew is affected by our choice of method; there
are interesting situations where one can decrease Ēnew by trading bias for variance or vice versa.
63
5 How well does a method perform?
We continue with the (never properly defined) notion of model complexity. High model complexity means
that the model is able to express more complicated functions, implying that the bias term is small. On
the other hand, the more complex a model is, the more it will adapt to training data T —not only to the
interesting patterns, but also to the actual data points and noise that happened to be in that realization of
the training data. Had there been another realization of T , the trained model could have looked (very)
differently. Exactly such ‘sensitivity’ to training data is described by the variance term. In summary, if
everything else is fixed and only the model complexity increases, the variance also increases, but the bias
decreases. The optimal model complexity (smallest Enew ) is therefore usually ‘somewhere in the middle’,
where the model has a good trade-off between bias and variance. This is illustrated by Figure 5.5.
One should remember that the model complexity is not simply the number of parameters in the model,
but rather a measure of how much the model adapts to complicated patterns in the training data. We
introduced regularization in Section 2.6 as a method to counteract overfit, or effectively decreasing the
model complexity, without changing the number of parameters in the model. Regularization therefore
gives a tool for changing the model complexity in a continuous fashion, which opens up for fine-tuning of
this bias-variance trade-off. This is further explored in Example 5.2
Example 5.2: Regularization—trading bias for variance
Let us consider a simulated regression example. We let p(x) and p(y | x) be defined as x ∼ U[0, 1]
and
y = 5 − 2x + x3 + ε, ε ∼ N (0, 1) .
We let the training data consist of only n = 10 samples. We now try to model the data using linear
regression with a 4th order polynomial, as
y = β0 + β1 x + β2 x2 + β3 x3 + β4 x4 + ε,
where we assume ε to have a Gaussian distribution (which happens to be true in this example),
so that we will end up with the normal equations. Since the model contains the true model and
least squares would not introduce any systematic errors, the bias term (5.14) would be exactly
zero. However, learning 5 parameters from only 10 data points will almost inevitably lead to very
high variance and overfit, so we decide to train the model with a regularized method, namely
Ridge Regression. Using regularization means that we will trade the unbiasedness (regularization
introduces systematic bias in how the model is trained) for smaller variance. Two examples on
what it could look like, for different regularization parameters, are shown:
γ = 10 γ = 0.001
10 10
8 8
6 6
y
4 4
2 2
0 0
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
x x
The dots are the n = 10 data points, the solid line is the trained model and the dashed line is
the true model. In the case with γ = 0.001, the plot suggests overfit, whereas γ = 10 seems to
be a case of underfit. It is clear how regularization affects the model complexity: with a small
regularization (in this case γ = 0.001), the model is prone to adapt to the noise in the training data.
64
5.3 Understanding Enew
The effect would be even more severe with no regularization (γ = 0). Heavier regularization (in
this case γ = 10) effectively prevents the model from adapting well to the training data (it pushes
the parameters, including β0 , towards 0).
Let us understand this in terms of bias and variance. In the low-regularized case, the trained
model (solid line) will look very different each time, depending on what x-values and noise happen
to be in the training data: a high variance. However, if one would repeat the experiment many
times with different training data, the average model will probably be relatively close to the true
model: a low bias. The completely opposite situation is found in the highly regularized case: the
variance is low (the model will be quite similar each time, no matter what realization of the training
data it is trained on), and the bias is high (the predictions from the model will systematically be
closer to zero than the true model).
Since we are in a simulated environment, we can repeat the experiment multiple times, and
thereby compute the bias and variance terms (or rather numerically estimate them, since we can
simulate as much training and test data as we want). We plot them in the very same style as
Figures 5.3 and 5.5 (note the reversed x-axis: a smaller regularization parameter corresponds to a
higher model complexity). For this problem, the optimal value of γ would have been about 0.7
(since Ēnew attains its minimum there).
Ēnew
Error
or
2
ible err
Bias2 Ētrain Irreduc Variance
0
101 100 10−1 10−2 10−3
Regularization parameter
If this had been a real problem with a fixed data set, we could of course not have made this plot.
Instead, one would have to rely on cross-validation for estimating Enew for that particular data set
(and not its average Ēnew ).
65
5 How well does a method perform?
The bias-variance trade-off and its relation to the size n of training data
In the first place, the bias term is a property of the model rather than of the training data set, and we may
think4 of the bias term as independent of the number of data points n in the training data. The variance
term, on the other hand, varies highly with n. As we know, Ēnew typically decreases as n increases, and
essentially the entire decline in Ēnew is because of the decline in the variance. Intuitively, the more data,
the more information about the parameters, meaning less variance. This is summarized by Figure 5.6.
Error
Variance Irreducible
error
Bias2
Bias2
Figure 5.6: The typical relationship between bias, variance and the size n of the training data set (cf. Figure 5.4).
The bias is (approximately) constant, whereas the variance decreases as the size of the training data set increases.
4
Indeed, the average model g might be different if we are averaging over an infinite number of models each trained with n = 2
or n = 100 000 data points. That effect is, however, conceptually not very interesting here, and we will not treat it further.
66
6 Ensemble methods
In the previous chapters we have introduced some fundamental methods for machine learning. In this
chapter we will introduce techniques of a slightly different flavor, referred to as ensemble methods. These
methods are based on the idea of combining the predictions from many so called base models. They can
therefore be seen as a type of meta-algorithms, in the sense that they are methods composed of other
methods.
We start in Section 6.1 by introducing a general technique referred to as bootstrap aggregating, or
bagging for short. The idea behind bagging is to train multiple models of the same type in parallel, but on
slightly different “versions” of the training data. By averaging the predictions of the resulting ensemble of
models it is possible to reduce the variance compared to using only a single model. This idea is extended
in Section 6.2, resulting in a powerful off-the-shelf method called random forests. Random forests make
use of classification or regression trees as base models. Each tree is randomly perturbed in a certain way
which opens up for additional variance reduction. Finally, in Section 6.3 we derive an alternative ensemble
method known as boosting. Boosting is different from bagging and random forests, since its base models
are learned sequentially, one after the other, so that each model tries to correct for the mistakes made by
the previous ones. By taking a weighted average of the predictions made by the base models, it is possible
to turn the ensemble of “weak” models into one “strong” model.
6.1 Bagging
As discussed in the previous chapter, a central concept in machine learning is the bias–variance trade-off.
Roughly speaking, the more flexible a model is, the lower its bias will be. That is, a flexible model
is capable of representing complicated input–output relationships. This is of course beneficial if the
actual relationship between inputs and outputs is complicated, as is often the case in machine learning
applications. We have seen examples of highly flexible models, not least in Chapter 4 when we discussed
non-parametric models. For instance, k-NN with a small value of k, or a classification tree that is grown
deep, are capable of expressing very complicated decision boundaries. However, as previously discussed
there is a price to pay when using a very flexible model: fitting it based on a finite number of training data
points can result in over-fitting1 , or equivalently, high model variance.
Motivated by this issue, we can ask the following question:
Given access to a low-bias, high-variance method, is there a way to reduce the variance while
keeping the bias low?
In this section we will introduce a general technique for accomplishing this, referred to as bootstrap
aggregating, or bagging.
67
6 Ensemble methods
computing the mean and the variance of the average of these variables we get
" B
#
1 X
E zi = µ, (6.1a)
B
b=1
" B
#
1 X 1−ρ 2
Var zb = σ + ρσ 2 . (6.1b)
B B
b=1
The first equation tells us that the mean is unaltered by averaging a number of identically distributed
random variables. Furthermore, the second equation tells us that the variance is reduced by averaging, as
long as the correlation ρ < 1. Indeed, the first term in the variance expression can be made arbitrarily
small by increasing B. We also note that the smaller ρ is, the larger the possible reduction in variance.
To see how this relates to the question posed above, we note that the bias of a prediction model is
tightly connected to its mean value. Consequently, by averaging the predictions from several identically
distributed models, each with a low bias, the bias remains low (cf. (6.1a)) and the variance is reduced (cf.
(6.1b)). How then can we construct such a collection—or ensemble—of prediction models? To answer
this question we will start with an unpractical assumption, which will later be relaxed, namely that we have
access to B independent training data sets3 , each of size n. Let these data sets be denoted by T 1 , . . . , T B .
We can then train a separate low-bias model, such as a deep classification or regression tree, on each
separate data set.
For concreteness we consider the regression setting. Each of the B regression models can be used
to predict the output for some test input x? . Since (by our unpractical assumption) the B training data
sets are independent, this will result in B independent and identically distributed predictions yb?1 , . . . , yb?B .
Taking the average of these predictions we get,
B
1 X b
yb?avg = yb? . (6.2)
B
b=1
In the classification setting we would instead use a majority vote among the B ensemble members.
The average prediction in (6.2) has the same bias as each of the individual yb?b ’s, but its variance is
reduced by a factor B. Indeed, in this case we have ρ = 0 in the expression (6.1b) since the individual
data sets were assumed to be independent. However, as mentioned above this is not a realistic assumption,
so in order to use this variance reduction technique in practice we need to do something differently. To
this end we will make use of a trick known as data bootstrapping.
68
6.1 Bagging
The bootstrapping method is stated in algorithm 5 and illustrated in in Example 6.1 below. Note that
the sampling is done with replacement, meaning that the resulting bootstrapped data set may contain
multiple copies of some of the original training data points, whereas other data points are not included in
the bootstrap sample.
Consider the same training data set as used in Example 4.2, consisting of n = 10 data points with
a two-dimensional input x = (x1 , x2 ) and a binary output y ∈ {Blue, Red}. The data set is shown
below.
Index x1 x2 y
1 9.0 2.0 Blue 8
Thus, (e x2 , ye2 ) = (x10 , y10 ), etc. We end up with the following data set, where
x1 , ye1 ) = (x2 , y2 ), (e
the numbers in parentheses in the right panel indicate that there are multiple copies of some of the
original data points in the bootstrapped data.
Bootstrapped data, Te = {e
xi , yei }10
i=1
69
6 Ensemble methods
10
Index e1
x e2
x ye
2 1.0 4.0 Blue 8
x2
9 9.0 8.0 Red 4
(2)
2 1.0 4.0 Blue
5 1.0 2.0 Blue (2)
10 9.0 6.0 Red
2
By running algorithm 5 repeatedly B times we obtain B identically distributed bootstrapped data sets
Te 1 , . . . , Te B . We can then use the bootstrapped data sets to train B low-bias regression (or classification)
models and average their predictions, ye?1 , . . . , ye?B , analogously to (6.2):
B
1 X b
yb?bag = ye? . (6.3)
B
b=1
The averaged predictions are in this case not independent since the bootstrapped data sets all come from
the same original data T . However the hope is that the predictions are sufficiently uncorrelated (ρ is small
enough) to result in a meaningful variance reduction. Experience has shown that this is indeed often the
case. We illustrate the bagging method in Example 6.2.
Consider a toy regression problem, where the 5-dimensional input is multivariate Gaussian
x ∼ N (0, Σ) with
1 0.98 0.98 0.98 0.98
0.98 1 0.98 0.98 0.98
Σ = 0.98 0.98 1 0.98 0.98
0.98 0.98 0.98 1 0.98
0.98 0.98 0.98 0.98 1
and the output is y = x21 + with ∼ N (0, 1). We observe a training data set consisting of
n = 30 inputs and outputs. Due to the strong correlations between the input variablesa , they are all
informative about the value of the output, despite the fact that the output only depends directly on
the first input variable, x1 .
We start by training a single regression model, branching until there are at most 3 observations
per leaf node, resulting in the model shown below (the values in the leaf nodes are the constant
predictions within each region).
70
6.1 Bagging
To apply the bagging method, we need to generate bootstrapped data sets. Running algorithm 5
once, we obtain a set of sampled indices, shown by the histogram below. Note that some data points
are sampled multiple times, whereas other data points are not included at all in the bootstrapped
data.
Histogram of indices
4
3
Frequency
2
1
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
indices
Training another regression tree on the bootstrapped data gives a slightly different model:
This procedure can then be repeated to generate an arbitrary number of bootstrap data sets and
regression models, each one possibly different from the ones before due to the randomness in the
bootstrap procedureb . Below we show the first 12 models that we obtain.
71
6 Ensemble methods
X.1 < 1.1 X.2 < 1.2 X.3 < 1.6 X.1 >= −1.4
>= 1.1 >= 1.2 >= 1.6 < −1.4
X.2 >= −1.1 X.2 >= −1.1
X.2 >= −1.1 X.5 < 0.83
< −1.1 < −1.1
< −1.1 >= 0.83
X.3 < 0.21 X.5 < −0.36
>= 0.21 >= −0.36 X.5 < −0.31 X.1 >= −0.49
X.5 >= −0.57 X.1 >= 0.74 X.2 >= −0.34 X.2 >= 0.35 >= −0.31 < −0.49
< −0.57 < 0.74 < −0.34 < 0.35
X.3 >= −0.85 X.5 < −0.011
X.5 < −0.44 X.1 >= −0.65
< −0.85 >= −0.011
>= −0.44 < −0.65
−0.65 0.35 0.53 0.34 1.8 2.5 4.3 −1.2 0.19 0.77 0.47 1.5 2.8 3.4 −0.21 0.77 0.84 2.7 5.7 −0.24 0.33 0.73 2 2.4
X.3 < 0.25 X.1 < 1.1 X.1 < 1.1 X.5 < 1.2
>= 1.1
>= 0.25 >= 1.1 >= 1.2
X.1 >= −0.9
X.1 >= −0.9 X.2 >= 2
X.1 >= −0.9 X.1 < 0.79 < −0.9 X.3 >= −0.24
X.4 < −0.25
< −0.9 <2
< −0.9 >= 0.79 < −0.24
>= −0.25 X.3 < 0.25
X.2 >= −0.26 X.2 >= 0.63 >= 0.25 X.5 < −0.011 X.2 < −0.59
< 0.63
< −0.26 X.2 >= −0.43 X.1 >= 0.78 >= −0.011 >= −0.59
X.3 < 0.21
< −0.43 < 0.78
X.2 < −0.69 >= 0.21 X.3 >= 0.57
X.1 >= 0.23 X.2 < −0.63
>= −0.69 < 0.57
< 0.23 >= −0.63
−0.36 −0.2 0.42 1.9 1.7 2.5 −0.26 0.14 0.036 0.96 1.7 3.1 4.2 −0.34 −0.11 0.98 0.14 1.6 1.6 3.2 4.5 −0.75 0.12 0.96 0.72 1.4 4.9
X.4 < 1.1 X.1 < 0.82 X.2 < 1.2 X.3 < 0.18
>= 1.1
>= 0.82 >= 1.2 >= 0.18
X.3 >= −0.89
X.1 >= 0.78 X.1 >= −0.9 X.1 < 0.83
< −0.89 X.2 >= −1.1
X.3 < 0.21
< 0.78 < −0.9 >= 0.83
< −1.1
>= 0.21 X.5 < 0.37 X.5 < −0.31 X.3 >= 0.53
X.2 < −0.67 X.3 >= 0.53 X.5 < −0.31 >= 0.37 >= −0.31 < 0.53
>= −0.67 < 0.53
>= −0.31 X.5 >= −0.57 X.5 >= −0.51
X.1 >= −0.49
< −0.57 < −0.51
< −0.49 X.5 >= −0.57
X.5 < −0.011 X.3 < 0.019 X.4 < −0.77 X.4 < −0.83
< −0.57
>= −0.011 >= 0.019 >= −0.77 >= −0.83
−0.28 −0.11 0.54 0.93 0.99 1.9 2.8 3.7 −0.25 0.3 1.3 2.5 3.1 −0.16 −0.23 0.61 0.23 1.6 1.6 4.5 −0.81 −0.27 0.69 0.78 2.1 0.7 2.1 3.1
To test the performance of the bagging method for this example, we simulate a large test data set
consisting of ntest = 2000 samples from the true data distribution (which is known since we are
working with a toy model). We can then compute the root mean squared error (RMSE) for varying
number of ensemble members B. The results are shown below:
Single tree
Bagging
2.8
Test error (RMSE)
2.6
2.4
2.2
0 20 40 60 80 100
Ensemble members (B)
Two things are worth noting in the figure above: First, the bagging method does improve the RMSE
over a single tree as a result of variance reduction (here, we get roughly an 18 % reduction in
RMSE, but this is of course problem dependent). Second, the performance of the bagging method
improves as we increase the number of ensemble members, but saturates at some point beyond
which no further improvement (nor degradation) is noticeable.
a
Recall that the values on the diagonal of the covariance matrix are the marginal variances of the elements xi , and the
off-diagonal values are covariances. Specifically, since all marginal variances in this example are 1, any pair of
variables xi and xj for i 6= j has a correlation of 0.98.
b
With n training data points there are in total nn /n! unique bootstrap datasets.
At first glance, one might think that a bagging model (6.3) becomes more “complex” as the number of
ensemble members B increase, and that we therefore run a risk of overfitting for large B. However, from
the example above this does not appear to be the case. Indeed, the test RMSE plateaus at some value
and does not start start to increase even for large values of B. This is in fact the expected (and intended)
behavior. Using more ensemble members does not make the model more flexible, but on the contrary,
72
6.2 Random forests
reduces its overfitting (or equivalently, its variance). This can be seen by considering the limiting behavior
as B → ∞, in which case the bagging model becomes
1 X b B→∞ h b i
B
yb?bag = ye? −→ E ye? | T , (6.4)
B
b=1
where the expectation is w.r.t. the randomness of the bootstrapping algorithm. Since the right hand side of
this expression does not depend on B it can be thought of as a regression model with “limited flexibility”,
and the flexibility of the bagging model will therefore not increase as B becomes large. With this in mind,
in practice the choice of B is mainly guided by computational constraints. The larger B the better, but
increasing B when there is no further reduction in test error is computationally wasteful.
73
6 Ensemble methods
problems (values rounded down to closest integer). A more systematic way of selecting m, however, is by
some type of cross-validation. We illustrate the random forest regression model in the example below.
Example 6.3: Random forests
We continue Example 6.2 and apply random forests to the same data. The test root mean squared
errors for different number of ensemble members are shown in the figure below.
2.5
0 20 40 60 80 100
Ensemble members (B)
In this example, tthe decorrelation effect of the random input selection used in random forests
result in an additional 13 % drop in RMSE compared to bagging.
6.3 Boosting
Boosting is built on the idea that even a simple (or weak) high-bias regression or classification model
often can capture some of the relationship between the inputs and the output. Thus, by training multiple
weak models, each describing part of the input-output relationship, it might be possible to combine the
predictions of these models into an overall better prediction. Hence, the intention is to turn an ensemble of
weak models into one strong model.
Boosting shares some similarities with bagging, as discussed above. Both are ensemble methods, in
the sense that they are based on combining the predictions from multiple models (an “ensemble”). Both
bagging and boosting can also be viewed as meta-algorithms, in the sense that they can be used to combine
essentially any regression or classification algorithm—they are algorithms built on top of other algorithms.
However, there are also important differences between boosting and bagging which we will discuss below.
The first key difference is in the construction of the ensemble. In bagging we construct B models in
parallel. These models are random (based on randomly bootstrapped datasets) but they are identically
distributed. Consequently, there is no reason to trust one model more than another, and the final prediction
of the bagged model is based on a plain average or majority vote of the individual predictions of the
ensemble members.
Boosting, on the other hand, uses a sequential construction of the ensemble members. Informally, this
is done in such a way that each model tries to correct the mistakes made by the previous one. This is
accomplished by modifying the training data set at each iteration in order to put more emphasis on the
data points for which the model (so far) has performed poorly. In the subsequent sections we will see
exactly how this is done for a specific boosting algorithm known as AdaBoost (Freund and Schapire 1996).
First, however, we consider a simple example to illustrate the idea.
74
6.3 Boosting
We consider a toy binary classification problem with two input variables, x1 and x2 . The training
data consists of n = 10 data points, 5 from each of the two classes. We use a decision stump, a
classification tree of depth one, as a simple (weak) base classifier. A decision stump means that we
select one of the input variables, x1 or x2 , and split the input space into two half spaces, in order to
minimize the training error. This results in a decision boundary that is perpendicular to one of the
axes. The left panel below shows the training data, illustrated by red crosses and blue dots for the
two classes, respectively. The colored regions shows the decision boundary for a decision stump
yb1 (x) trained on this data.
yb1 (x)
Iteration b = 1
yb2 (x)
Iteration b = 2
yb3 (x)
Iteration b = 3
yb (x)
boost model
Boosted
X2
X2
X2
X2
X1 X1 X1 X1
Figure 6.1: Three iterations of AdaBoost for toy problem. Training data points for the two classes is shown
by red crosses and blue dots, respectively, scaled according to the weights. The colored regions show the
three base classifiers (decision stumps), yb1 , yb2 , and yb3 .
The model yb1 (x) misclassifies three data points (red crosses falling in the blue region), which are
encircled in the figure. To improve the performance of the classifier we want to find a model that
can distinguish these three points from the blue class. To this end, we train another decision stump,
yb2 (x), on the same data. To put emphasis on the three misclassified points, however, we assign
weights {wi2 }ni=1 to the data. Points correctly classified by yb1 (x) are down-weighted, whereas
the three points misclassified by yb1 (x) are up-weighted. This is illustrated in the second panel
of Figure 6.1, where the marker sizes have been scaled according to thePweights. The classifier
yb2 (x) is then found by minimizing the weighted misclassification error, n1 ni=1 wi2 I{yi 6= yb2 (xi )},
resulting in the decision boundary shown in the second panel. This procedure is repeated for a
third and final iteration: we update the weights based on the hits and misses of yb2 (x) and train a
third decision stump yb3 (x) shown in the third panel. The final classifier, ybboost (x) is then taken as
a combination of the three decision stumps. Its (nonlinear) decision boundaries are shown in the
right panel.
75
6 Ensemble methods
Exponential loss
Misclassification loss
Loss
1
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
Margin, y · C(x)
y · C(x). (6.5)
It follows that if y and C(x) have the same sign, that is if the classification is correct, then the margin
is positive. Analogously, for an incorrect classification y and C(x) will have different signs and the
margin is negative. More specifically, since y is either −1 or +1, the margin is simply |C(x)| for correct
classifications and −|C(x)| for incorrect classifications. The margin can thus be viewed as a measure of
certainty in a prediction, where values with small margin in some sense (not necessarily Euclidian!) are
close to the decision boundary. The margin plays a similar role for binary classification as the residual
y − f (x) does for regression.
Loss functions for classification can be defined in terms of the margin, by assigning small loss to positive
margins and large loss to negative margins. One such loss function is the exponential loss defined as
Figure 6.2 illustrates the exponential loss and compares it against the misclassification loss, which is
simply I{y · C(x) < 0}.
Remark 6.1 The misclassification loss is often used to evaluate the performance of a classifier (in
particular if interest only lies in the number of correct and incorrect classification). However, it is typically
not suitable to use directly during training of the model. The reason is that it is discontinuous which is
problematic in a numerical minimization of the training loss. The exponential loss function, on the other
hand, is both convex and (infinitely many times) differentiable. These are nice properties to have when
optimizing the training loss. In fact, the exponential loss can in this way be seen as a convenient proxy for
misclassification loss during training. Other loss functions of interest are discussed in Section 6.3.5.
6.3.3 AdaBoost
We are now ready to derive a practical boosting method, the AdaBoost (Adaptive Boosting) algorithm
proposed by Freund and Schapire (1996). AdaBoost was the first successful practical implementation of
the boosting idea and lead the way for its popularity. Freund and Schapire were awarded the prestigious
Gödel Prize in 2003 for their algorithm.
Recall from the discussion above that boosting attempts to construct a sequence of B (weak) classifiers
yb1 (x), yb2 (x), . . . , ybB (x). Any classification model can in principle be used to construct these so
76
6.3 Boosting
called base classifiers—shallow classification trees are common in practice (see Section 6.3.4 for further
discussion). The individual predictions of the B ensemble members are then combined into a final
prediction. However, all ensemble members are not treated equally. Instead, we assign some positive
coefficients {αb }B
b=1 and construct the boosted classifier using a weighted majority vote according to
( B )
X
B
ybboost (x) = sign b b
α yb (x) . (6.7)
b=1
Note that each ensemble member votes either −1 or +1. The output from the boosted classifier is +1 if
the weighted sum of the individual votes is positive and −1 if it is negative.
How, then, do we learn the individual ensemble members and their coefficients? The answer to this
question depends on the specific boosting method that is used. For AdaBoost this is done by greedily
minimizing the exponential loss of the boosted classifier at each iteration.PNote that we can write the
boosted classifier after b iterations as ybboost
b = sign{C b (x)} where C b (x) = bj=1 αj ybj (x). Furthermore,
we can express the function C b (x) sequentially as
initialized with C 0 (x) ≡ 0. Since we construct the ensemble members sequentially, when at iteration b of
the procedure the function C b−1 (x) is known and fixed. Thus, what remains to be learned at iteration b is
the coefficient αb and the ensemble member ybb (x). This is done by minimizing the exponential loss of the
training data,
n
X
(αb , ybb ) = arg min L(yi , C b (xi )) (6.9a)
(α,b
y) i=1
Xn n o
= arg min exp −yi C b−1 (xi ) + αb
y (xi ) (6.9b)
(α,b
y) i=1
Xn
= arg min wib (−yi αb
y (xi )) (6.9c)
(α,b
y) i=1
where for the first equality we have used the definition of the exponential loss function (6.6) and the
sequential structure of the boosted classifier (6.8). For the second equality we have defined the quantities
def
wib = exp −yi C b−1 (xi ) , i = 1, . . . , n, (6.10)
which can be interpreted as weights for the individual data points in the training data set. Note that these
weights are independent of α and yb and can thus be viewed as constants when solving the optimization
problem (6.9c) at iteration b of the boosting procedure (the fact that we keep previous coefficients and
ensemble members fixed is what makes the optimization “greedy”).
To solve (6.9) we start by writing the objective function as
n
X n
X n
X
y (xi )) = e−α
wib exp (−yi αb wib I{yi = yb(xi )} + eα wib I{yi 6= yb(xi )}, (6.11)
i=1 i=1 i=1
| {z } | {z }
=Wc =We
where we have used the indicator function to split the sum into two sums: the first ranging over all training
data points correctly classified by yb and the second ranging over all point erroneously classified by yb.
Furthermore, for notational simplicity we define Wc and We for the sum of weights of correctly classified
and erroneously
P classified data points, respectively. Furthermore, let W = Wc + We be the total weight
sum, W = ni=1 wib .
77
6 Ensemble methods
Minimizing (6.11) is done in two stages, first w.r.t. yb and then w.r.t. α. This is possible since the
minimizing argument in yb turns out to be independent of the actual value of α > 0. To see this, note that
we can write the objective function (6.11) as
Since the total weight sum W is independent of yb and since eα − e−α > 0 for any α > 0, minimizing this
expression w.r.t. yb is equivalent to minimizing We w.r.t. yb. That is,
n
X
b
yb = arg min wib I{yi 6= yb(xi )}. (6.13)
yb i=1
In words, the bth ensemble member should be trained by minimizing the weighted misclassification loss,
where each data point (xi , yi ) is assigned a weight wib . The intuition for these weights is that, at iteration b,
we should focus our attention on the data points previously misclassified in order to “correct the mistakes”
made by the ensemble of the first b − 1 classifiers.
How the problem (6.13) is solved in practice depends on the choice of base classifier that we use, i.e. on
the specific restrictions that we put on the function yb (for example a shallow classification tree). However,
since (6.13) is essentially a standard classification objective it can be solved, at least approximately, by
standard learning algorithms. Incorporating the weights in the objective function is straightforward for
most base classifiers, since it simply boils down to weighting the individual terms of the loss function
used when training the base classifier.
When the bth ensemble member, ybb (x), has been learned by solving (6.13) it remains to compute its
coefficient αb . Recall that this is done by minimizing the objective function (6.12). Differentiating this
expression w.r.t. α and setting the derivative to zero we get the equation
−αe−α W + α eα + e−α We = 0
⇐⇒ W = e2α + 1 We (6.14)
1 W
⇐⇒ α = log −1 .
2 We
Thus, by defining
n
def We X wb
b
Etrain = = Pn i b I{yi 6= ybb (xi )} (6.15)
W j=1 wj
i=1
to be the weighted misclassification error for the bth classifier, we can express the optimal value for its
coefficient as
b
1 1 − Etrain
b
α = log b
. (6.16)
2 Etrain
This completes the derivation of the AdaBoost algorithm, which is summarized in algorithm 6. In the
algorithm we exploit the fact that the weights (6.10) can be computed recursively by using the expression
(6.8); see line 2. Furthermore, we have added an explicit weight normalization (line 2) which is convenient
in practice and which does not affect the derivation of the method above.
Remark 6.2 One detail worth commenting is that the derivation of the AdaBoost procedure assumes that
all coefficients {αb }Bb=1 are positive. To see that this is indeed the case when the coefficients are computed
according to (6.16), note that the function log((1 − x)/x) is positive for any 0 < x < 0.5. Thus, αb will
be positive as long as the weighted training error for the bth classifier, Etrain
b , is less than 0.5. That is, the
classifier just has to be slightly better than a coin flip, which is always the case in practice (note that Etrain
b
78
6.3 Boosting
Algorithm 6: AdaBoost
1. Assign weights wi1 = 1/n to all data points.
2. For b = 1 to B
(a) Train a weak classifier ybb (x) on the weighted training data {(xi , yi , wib )}ni=1 .
In the method discussed above we have assumed that each base classifier outputs a discrete class
prediction, ybb (x) ∈ {−1, 1}. However, many classification models used in practice are in fact based on
estimating the conditional class probabilities p(y | x) as discussed in Section 3.4.1. Hence, it is possible to
instead let each base model output a real number and use these numbers as the basis for the “vote”. This
extension of algorithm 6 is discussed by Friedman, Hastie, and Tibshirani (2000) and is referred to as Real
AdaBoost.
79
6 Ensemble methods
Exponential loss
Hinge loss
Binomial deviance
2 Huber-like loss
Misclassification loss
Loss
0
−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3
Margin, y · C(x)
many base models can result in overfitting. It has been observed in practice, however, that this overfitting
often occurs slowly and the performance tends to be rather insensitive to the choice of B. Nevertheless, it
is a good practice to select B in some systematic way, for instance using so called early stopping (this is
common also in the context of training neural networks; see Section 7.4.4).
80
6.3 Boosting
We provide pseudo-code for one instance of a gradient boosting method in algorithm 7. As can be seen
from the algorithm, the key step involves fitting a base model to the negative gradient of the loss function.
This can be understood via the intuitive interpretation of boosting, that each base model should try to
correct the mistakes made by the ensemble thus far. The negative gradient of the loss function gives an
indication of in which “direction” the model should be updated in order to reduce the loss.
2. For b = 1 to B
(a) Compute the negative gradient of the loss function,
b ∂L(yi , c)
gi = − , i = 1, . . . , n.
∂c c=C b−1 (xi )
(b) Train a base regression model fbb (x) to fit the gradient values,
n
X 2
fbb = arg min f (xi ) − gib .
f i=1
3. Output ybboost
B (x) = sign{C B (x)}.
While presented for classification in algorithm 7, gradient boosting can also be used for regression with
minor modifications. In fact, an interesting aspect of the algorithm presented here is that the base models
fbb (x) are found by solving a regression problem despite the fact that the algorithm produces a classifier.
The reason for this is that the negative gradient values {gib }ni=1 are quantitative variables, even if the data
{yi }ni=1 is qualitative. Here we have considered fitting a base model to these negative gradient values by
minimizing a square loss criterion.
The value γ used in the algorithm (line 2(c)) is a tuning parameter which plays a similar role to the
step size in ordinary gradient descent. In practice this is usually found by line search (see Appendix B),
often combined with a type of regularization via shrinkage (Friedman 2001). When using trees as base
models—as is common in practice—optimizing the steps size can be done jointly with finding the terminal
node values, resulting in a more efficient implementation (Friedman 2001).
As mentioned above, gradient boosting requires a certain amount of smoothness in the loss function.
A minimal requirement is that it is almost everywhere differentiable, so that it is possible to compute
the gradient of the loss function. However, some implementations of gradient boosting require stronger
conditions, such as second order differentiability. The binomial deviance (see Figure 6.3) is in this respect
a “safe choice” which is infinitely differentiable and strongly convex, while still enjoying good statistical
properties. As a consequence, binomial deviance is one of the most commonly used loss functions in
practice.
81
6 Ensemble methods
82
7 Neural networks and deep learning
Neural networks can be used for both regression and classification, and they can be seen as an extension of
linear regression and logistic regression, respectively. Traditionally neural networks with one so-called
hidden layer have been used and analysed, and several success stories came in the 1980s and early 1990s.
In the 2000s it was, however, realized that deep neural networks with several hidden layers, or simply deep
learning, are even more powerful. With the combination of new software, hardware, parallel algorithms
for training and a lot of training data, deep learning has made a major contribution to machine learning.
Deep learning has excelled in many applications, including image classification, speech recognition and
language translation. New applications, analysis, and algorithmic developments to deep learning are
published literally every day.
We will start in Section 7.1 by generalizing linear regression to a two-layer neural network (i.e., a neural
network with one hidden layer), and then generalize it further to a deep neural network. We thereafter
leave regression and look at the classification setting in Section 7.2. In Section 7.3 we present a special
neural network tailored for images and finally we look in to some of the details on how to train neural
networks in Section 7.4.
y = f (x1 , . . . , xp ; θ) + , (7.1)
where is an error term and the function f is parametrized by θ. Such a nonlinear function can be
parametrized in many ways. In a neural network, the strategy is to use several layers of linear regression
models and nonlinear activation functions. We will explain this carefully in turn below. For the model
description it will be convenient to define z as the output without the noise term ,
z = β0 1 + β1 x1 + β2 x2 + · · · + βp xp , (7.3)
which is shown in Figure 7.1a. Each input variable xj is represented with a node and each parameter βj
with a link. Furthermore, the output z is described as the sum of all terms βi xj . Note that we use 1 as the
input variable corresponding to the offset term β0 .
To describe nonlinear relationships between x = [1 x1 x2 . . . xp ]T and z we introduce a nonlinear
scalar function called the activation function σ : R → R. The linear regression model (7.3) is now
modified into a generalized linear regresssion model where the linear combination of the inputs is squashed
through the (scalar) activation function
z = σ β0 + β1 x1 + β2 x2 + · · · + βp xp . (7.4)
This extension to the generalized linear regression model is visualized in Figure 7.1b.
83
7 Neural networks and deep learning
1 β0 1 β0
x1 x1
.. z .. σ z
. .
xp βp xp βp
(a) (b)
Figure 7.1: Graphical illustration of a linear regression model (Figure 7.1a), and a generalized linear regression
model (Figure 7.1b). In Figure 7.1a, the output z is described as the sum of all terms β0 and {βj xj }pj=1 , as in (7.3).
In Figure 7.1b, the circle denotes addition and also transformation through the activation function σ, as in (7.4).
1 σ(x) 1 σ(x)
x x
−6 6 −1 1
Logistic: σ(x) = 1
1+e−x ReLU: σ(x) = max(0, x)
(a) (b)
Figure 7.2: Two common activation functions used in neural networks. The logistic (or sigmoid) function
(Figure 7.2a), and the rectified linear unit (Figure 7.2b).
Common choices for activation function are the logistic function and the rectified linear unit (ReLU).
These are illustrated in Figure 7.2a and Figure 7.2b, respectively. The logistic (or sigmoid) function has
already been used in the context of logistic regression (Section 3.2). The logistic function is affine close to
x = 0 and saturates at 0 and 1 as x decreases or increases. The ReLU is even simpler. The function is the
identity function for positive inputs and equal to zero for negative inputs.
The logistic function used to be the standard choice of activation function in neural networks for many
years, whereas the ReLU has gained in popularity (despite its simplicity!) during recent years and it is
now the standard choice in most neural network models.
The generalized linear regression model (7.4) is very simple and is itself not capable of describing
very complicated relationships between the input x and the output z. Therefore, we make two further
extensions to increase the generality of the model: We will first make use of several generalized linear
regression models to build a layer (which will lead us to the two-layer neural network) and then stack these
layers in a sequential construction (which will result in a deep neural network, or simply deep learning).
In (7.4), the output is constructed by one scalar regression model. To increase its flexibility and turn it into
a two-layer neural network, we instead let the output be a sum of M such generalized linear regression
models, each of which has its own parameters. The parameter for the ith regression model are β0i , . . . , βpi
and we denote its output by hi ,
These intermediate outputs hi are so-called hidden units, since they are not the output of the whole model.
The M different hidden units {hi }M i=1 instead act as input variables to an additional linear regression
model
z = β0 + β1 h1 + β2 h2 + · · · + βM hM . (7.6)
To distinguish the parameters in (7.5) and (7.6) we add the superscripts (1) and (2), respectively. The
equations describing this two-layer neural network (or equivalently, neural network with one layer of
84
7.1 Neural networks for regression
(1) 1
1 β01 (2)
β0
σ h1
x1
σ z
..
. ..
. (2)
xp βM
hM
(1) σ
βpM
Figure 7.3: A two-layer neural network, or equivalently, a neural network with one intermediate layer of hidden
units.
Extending the graphical illustration from Figure 7.1, this model can be depicted as a graph with two-layers
of links (illustrated using arrows), see Figure 7.3. As before, each link has a parameter associated with it.
Note that we include an offset term not only in the input layer, but also in the hidden layer.
where we have also stacked the components in x and h as x = [x1 , . . . , xp ]T and h = [h1 , . . . , hM ]T .
The activation function σ acts element-wise. The two weight matrices and the two offset vectors will be
1
The word “bias” is often used for the offset vector in the neural network literature, but this is really just a model parameter and
not a bias in the statistical sense. To avoid confusion we refer to it as an offset instead.
85
7 Neural networks and deep learning
By this we have described a nonlinear regression model on the form y = f (x; θ) + where f (x; θ) = z
according to above. Note that the predicted output z in (7.9b) depends on all the parameters in θ even
though it is not explicitly stated in the notation.
The two-layer neural network is a useful model on its own, and a lot of research and analysis has been
done for it. However, the real descriptive power of a neural network is realized when we stack multiple
such layers of generalized linear regression models, and thereby achieve a deep neural network. Deep
neural networks can model complicated relationships (such as the one between an image and its class),
and is one of the state-of-the-art methods in machine learning as of today.
We enumerate the layers with index l. Each layer is parametrized with a weight matrix W(l) and an
offset vector b(l) , as for the two-layer case. For example, W(1) and b(1) belong to layer l = 1, W(2) and
b(2) belong to layer l = 2 and so forth. We also have multiple layers of hidden units denoted by h(l−1) .
(l) (l)
Each such layer consists of Ml hidden units h(l) = [h1 , . . . , hMl ]T , where the dimensions M1 , M2 , . . .
can be different for different layers.
Each layer maps a hidden layer h(l−1) to the next hidden layer h(l) as
This means that the layers are stacked such that the output of the first layer h(1) (the first layer of hidden
units) is the input to the second layer, the output of the second layer h(2) (the second layer of hidden units)
is the input to the third layer, etc. By stacking multiple layers we have constructed a deep neural network.
A deep neural network of L layers can mathematically be described as (cf. (7.9))
Analogously to the parametric models presented earlier (e.g. linear regression and logistic regression) we
need to learn all the parameters in order to use the model. For deep neural networks the parameters are
T
θ = vec(W(1) )T vec(W(2) )T · · · vec(W(L) )T b(1)T b(2) · · · b(L) (7.13)
86
7.1 Neural networks for regression
Figure 7.4: A deep neural network with L layers. Each layer is parameterized with W(l) and b(l) .
The wider and deeper the network is, the more parameters there are. Practical deep neural networks can
easily have in the order of millions of parameters and these models are therefore also extremely flexible.
Hence, some mechanism to avoid overfitting is needed. Regularization such as ridge regression is common
(cf. Section 2.6), but there are also other techniques specific to deep learning; see further Section 7.4.4.
Furthermore, the more parameters there are, the more computational power is needed to train the model.
As before, the training data T = {(xi , yi )}ni=1 consists of n samples of the input x and the output y.
For a regression
problem we typically start with maximum likelihood and assume Gaussian noise
∼ N 0, σ2 , and thereby obtain the square error loss function as in Section 2.3.1,
X n
b = arg min 1
θ L(xi , yi , θ) where L(xi , yi , θ) = kyi − f (xi ; θ)k2 = kyi − zi k2 .
θ n
i=1
(7.14)
This problem can be solved with numerical optimization, and more precisely stochastic gradient descent.
This is described in more detail in Section 7.4.
From the model, the parameters θ, and the inputs {xi }ni=1 we can compute the predicted outputs {zi }ni=1
using the model zi = f (xi ; θ). For example, for the two-layer neural network presented in Section 7.1.2
we have
hT T
i = σ(xi W
(1)
+ b(1) ), (7.15a)
zT T
i = hi W
(2)
+ b(2) . (7.15b)
In (7.15) the equations are transposed in comparison to the model in (7.9). This is a small trick such that
we easily can extend (7.15) to include multiple data points i. Similar to (2.4) we stack all data points in
matrices, where each data point represents one row
T T T T
y1 x1 z1 h1
.. .. .. ..
Y = . , X = . , Z = . , and H = . . (7.16)
ynT xn T znT T
hn
We can then write (7.15) as
H = σ(XW(1) + b(1) ), (7.17a)
Z = HW (2)
+b (2)
, (7.17b)
where we also have stacked the predicted output and the hidden units in matrices. This is also how the
model would be implemented in code. In Tensorflow, which will be used in the laboratory work in the
course, it can be written as
87
7 Neural networks and deep learning
Figure 7.5: A deep neural network with L layers for classification. The only difference to regression (Figure 7.4) is
the softmax transformation after layer L.
Note that in (7.17) the offset vectors b1 and b2 are added and broadcasted to each row. See more details
regarding implementation of a neural network in the instructions for the laboratory work.
Neural networks can also be used for classification where we have qualitative outputs y ∈ {1, . . . , K}
instead of quantitative. In Section 3.2 we extended linear regression to logistic regression by adding the
logistic function to the output. In the same manner we can extend the neural network presented in the last
section to a neural network for classification. In doing this extension, we will use the multi-class version
of logistic regression presented in Section 3.2.3, and more specifically the softmax parametrization given
in (3.13), repeated here for convenience
1 z T
softmax(z) = PK e1 ··· e zK . (7.18)
zj
j=1 e
The softmax function acts as an additional activation function on the final layer of the neural network.
In addition to the regression network in (7.12) we add the softmax function at the end of the network as
..
.
z = W(L)T H (L−1) + b(L)T , (7.19a)
T
p(1 | xi ) p(2 | xi ) . . . p(K | xi ) = softmax(z) (7.19b)
The softmax function thus maps the output of the last layer z1 , . . . , zK to the modeled class probabilities
p(1 | xi ), . . . , p(K | xi ), see also Figure 7.5 for a graphical illustration. The inputs to the softmax function,
i.e. the variables z1 , . . . , zK , are referred to as logits.
Note that the softmax function does not come as a layer with additional parameters, it only transforms
the previous output [z1 . . . zK ]T ∈ RK to the modeled probabilities [p(1 | xi ) . . . p(K | xi )]T ∈ [0, 1]K .
Also note that by construction of the softmax function, these values will sum to 1 regardless of the values
of [z1 , . . . , zK ] (otherwise it would not be probabilities).
88
7.3 Convolutional neural networks
The cross-entropy is close to its minimum if the predicted probability p(k | xi ; θ) is close to 1 for the
k for which yik = 1. For example, if the ith data point belongs to class k = 2 out of in total K = 3
T
classes we have yi = 0 1 0 . Assume that we have a set of parameters of the network denoted θ A ,
and with these parameters we predict p(1 | xi ; θ A ) = 0.1, p(2 | xi ; θ A ) = 0.8 and p(3 | xi ; θ A ) = 0.1
meaning that we are quite sure that data point i actually belongs to class k = 2. This would generate a
low cross-entropy L(xi , yi , θ A ) = −(0 · log 0.1 + 1 · log 0.8 + 0 · log 0.1) = 0.22. If we instead predict
p(1 | xi ; θ B ) = 0.8, p(2 | xi ; θ B ) = 0.1 and p(3 | xi ; θ B ) = 0.1, the cross-entropy would be much higher
L(xi , yi , θ B ) = −(0 · log 0.8 + 1 · log 0.1 + 0 · log 0.1) = 2.30. For this case, we would indeed prefer
the parameters θ A over θ B . This is summarized in Figure 7.6.
Computing the loss function explicitly via the logarithm could lead to numerical problems when
p(k | xi ; θ) is close to zero since log(x) → −∞ as x → 0. This can be avoided since the logarithm in the
cross-entropy loss function (7.20) can “undo” the exponential in the softmax function (7.18),
K
X K
X
L(xi , yi , θ) = − yik log p(k | xi ; θ) = − yik log[softmax(zi )]k
k=1 k=1
XK nP o
=− yik zik − log K
j=1 e
zij , (7.21)
k=1
XK nP o
=− yik zik − max zij − log K
j=1 e
z ij −max z
j ij , (7.22)
j
k=1
where zik are the logits.
89
7 Neural networks and deep learning
0.0 0.0 0.8 0.9 0.6 0.0 x1,1 x1,2 x1,3 x1,4 x1,5 x1,6
0.0 0.9 0.6 0.0 0.8 0.0 x2,1 x2,2 x2,3 x2,4 x2,5 x2,6
0.0 0.0 0.0 0.0 0.9 0.0 x3,1 x3,2 x3,3 x3,4 x3,5 x3,6
0.0 0.0 0.0 0.9 0.6 0.0 x4,1 x4,2 x4,3 x4,4 x4,5 x4,6
0.0 0.0 0.9 0.0 0.0 0.0 x5,1 x5,2 x5,3 x5,4 x5,5 x5,6
0.0 0.8 0.9 0.9 0.9 0.9 x6,1 x6,2 x6,3 x6,4 x6,5 x6,6
Figure 7.7: Data representation of a grayscale image with 6 × 6 pixels. Each pixel is represented with a number
describing the grayscale color. We denote the whole image as X (a matrix), and each pixel value is an input variable
xj,k (element in the matrix X).
(1) (1)
x1,1 x1,2 x1,3 x1,4 x1,5 x1,6 β1,3 σ σ σ σ σ σ x1,1 x1,2 x1,3 x1,4 x1,5 x1,6 β1,3 σ σ σ σ σ σ
x2,1 x2,2 x2,3 x2,4 x2,5 x2,6 σ σ σ σ σ σ x2,1 x2,2 x2,3 x2,4 x2,5 x2,6 σ σ σ σ σ σ
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6 β (1) σ σ σ σ σ σ x3,1 x3,2 x3,3 x3,4 x3,5 x3,6 (1)
β3,3 σ σ σ σ σ σ
3,3
x4,1 x4,2 x4,3 x4,4 x4,5 x4,6 σ σ σ σ σ σ x4,1 x4,2 x4,3 x4,4 x4,5 x4,6 σ σ σ σ σ σ
x5,1 x5,2 x5,3 x5,4 x5,5 x5,6 σ σ σ σ σ σ x5,1 x5,2 x5,3 x5,4 x5,5 x5,6 σ σ σ σ σ σ
x6,1 x6,2 x6,3 x6,4 x6,5 x6,6 σ σ σ σ σ σ x6,1 x6,2 x6,3 x6,4 x6,5 x6,6 σ σ σ σ σ σ
(a) (b)
Figure 7.8: An illustration of the interactions in a convolutional layer: Each hidden unit (circle) is only dependent
on the pixels in a small region of the image (red boxes), here of size 3 × 3 pixels. The location of the hidden
unit corresponds to the location of the region in the image: if we move to a hidden unit one step to the right, the
corresponding region also moves one step to the right, compare Figure 7.8a and Figure 7.8b. Furthermore, the nine
parameters β1,1 , β1,2 , . . . , β3,3 are the same for all hidden units in the layer.
(1) (1) (1)
90
7.3 Convolutional neural networks
Figure 7.9: An illustration of zero-padding. If the region is partly is outside the image. With zero-padding, the size
of the image can be preserved in the following layer.
provide too much flexibility for images and we might not be able to capture the patterns of real importance,
and hence not generalize and perform well on unseen data. Instead, a convolutional layer appears to exploit
the structure present in images to find a more efficiently parameterized model. In contrast to a dense layer,
a convolutional layer leverages two important concepts – sparse interactions and parameter sharing – to
achieve such a parametrization.
Sparse interactions
With sparse interactions we mean that most of the parameters in a dense layer are forced to be equal to
zero. More specifically, a hidden unit in a convolutional layer only depends on the pixels in a small region
of the image and not on all pixels. In Figure 7.8 this region is of size 3 × 3. The position of the region
is related to the position of the hidden unit in its matrix topology. If we move to the hidden unit one
step to the right, the corresponding region in the image also moves one step to the right, as displayed by
comparing Figure 7.8a and Figure 7.8b. For the hidden units on the border, the corresponding region is
partly located outside the image. For these border cases, we typically use zero-padding where the missing
pixels are replaced with zeros. Zero-padding is illustrated in Figure 7.9.
Parameter sharing
In a dense layer each link between an input variable and a hidden unit has its own unique parameter. With
parameter sharing we instead let the same parameter be present in multiple places in the network. In a
convolutional layer the set of parameters for the different hidden units are all the same. For example, in
Figure 7.8a we use the same set of parameters to map the 3 × 3 region of pixels to the hidden unit as we do
in Figure 7.8b. Instead of learning separate sets of parameters for every position we only learn one such set
of a few parameters, and use it for all links between the input layer and the hidden units. We call this set of
parameters a kernel. The mapping between the input variables and the hidden units can be interpreted as a
convolution between the input variables and the kernel, hence the name convolutional neural network.
The sparse interactions and parameter sharing in a convolutional layer makes the CNN fairly invariant to
translations of objects in the image. If the parameters in the kernel are sensitive to a certain detail (such as
a corner, an edge, etc.) a hidden unit will react to this detail (or not) regardless of where in the image that
detail is present! Furthermore, a convolutional layer uses a lot fewer parameters than the corresponding
dense layer. In Figure 7.8 only 3 · 3 + 1 = 10 parameters are required (including the offset parameter). If
we had used a dense layer (36 + 1) · 36 = 1332 parameters would have been needed! Another way of
interpreting this is: with the same amount of parameters, a convolutional layer can encode more properties
of an image than a dense layer.
91
7 Neural networks and deep learning
x1,1 x1,2 x1,3 x1,4 x1,5 x1,6 (1) x1,1 x1,2 x1,3 x1,4 x1,5 x1,6 (1)
β1,3 β1,3
x2,1 x2,2 x2,3 x2,4 x2,5 x2,6 x2,1 x2,2 x2,3 x2,4 x2,5 x2,6
σ σ σ σ σ σ
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6 (1)
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6 (1)
β3,3 σ σ σ β3,3 σ σ σ
x4,1 x4,2 x4,3 x4,4 x4,5 x4,6 x4,1 x4,2 x4,3 x4,4 x4,5 x4,6
σ σ σ σ σ σ
x5,1 x5,2 x5,3 x5,4 x5,5 x5,6 x5,1 x5,2 x5,3 x5,4 x5,5 x5,6
x6,1 x6,2 x6,3 x6,4 x6,5 x6,6 x6,1 x6,2 x6,3 x6,4 x6,5 x6,6
(a) (b)
Figure 7.10: A convolutional layer with stride [2,2] and kernel of size 3 × 3.
but to say every two pixels. If we apply the kernel to every two pixels both row-wise and column-wise, the
hidden units will only have half as many rows and half as many columns. For a 6 × 6 image we get 3 × 3
hidden units. This concept is illustrated in Figure 7.10.
The stride controls how many pixels the kernel shifts over the image at each step. In Figure 7.8 the
stride is [1,1] since the kernel moves by one pixel both row- and column-wise. In Figure 7.10 the stride is
[2,2] since it moves by two pixels row- and column-wise. Note that the convolutional layer in Figure 7.10
still requires 10 parameters, as the convolutional layer in Figure 7.8 does. Another way of condensing the
information after a convolutional layer is by subsampling the data, so-called pooling. The interested can
read further about pooling in Goodfellow, Bengio, and Courville 2016.
The networks presented in Figure 7.8 and 7.10 only have 10 parameters each. Even though this
parameterization comes with a lot of advantages, one kernel is probably not sufficient to encode all
interesting properties of the images in our data set. To extended the network, we add multiple kernels, each
with their own set of kernel parameters. Each kernel produces its own set of hidden units—a so-called
channel—using the same convolution operation as explained in Section 7.3.2. Hence, each layer of
hidden units in a CNN are organized into a tensor with the dimensions (rows × columns × channels).
In Figure 7.11, the first layer of hidden units has four channels and that hidden layer consequently has
dimension 6 × 6 × 4.
When we continue to stack convolutional layers, each kernel depends not only on one channel, but on
all the channels in the previous layer. This is displayed in the second convolutional layer in Figure 7.11.
As a consequence, each kernel is a tensor of dimension (kernel rows × kernel columns × input channels).
For example, each kernel in the second convolutional layer in Figure 7.11 is of size 3 × 3 × 4. If we
collect all kernels parameters in one weight tensor W , that tensor will be of dimension (kernel rows ×
kernel columns × input channels × output channels). In the second convolutional layer in Figure 7.11, the
corresponding weight matrix W(2) is a tensor of dimension 3 × 3 × 4 × 6. With multiple kernels in each
convolutional layer, each of them can be sensitive to different features in the image, such as certain edges,
lines or circles enabling a rich representation of the images in our training data.
A full CNN architecture consists of multiple convolutional layers. Typically, we decrease the number of
rows and columns in the hidden layers as we proceed though the network, but instead increase the number
of channels to enable the network to encode more high level features. After a few convolutional layers we
usually end a CNN with one or more dense layers. If we consider an image classification task, we place a
softmax layer at the very end to get outputs in the range [0,1]. The loss function when training a CNN will
be the same as in the regression and classification networks explained earlier, depending on which type of
problem we have at hand. In Figure 7.11 a small example of a full CNN architecture is displayed.
92
7.4 Training a neural network
Input variables Hidden units Hidden units Hidden units Logits Outputs
dim 6 × 6 × 1 dim 6 × 6 × 4 dim 3 × 3 × 6 dim M3 dim K dim K
(1)
x1,1 x1,2 x1,3 x1,4 x1,5 x1,6 β1,3,1 σ σ σ σ σ σ
σ σ σ σ σ σ
σ σ σ σ σ σ
x2,1 x2,2 x2,3 x2,4 x2,5 x2,6 σ σσ σσ σσ σσ σσ σ (2) σ
σ σ σ σ σ σ β1,3,4,6σ σ σ
σ σ σ σ σ σ σ σ σ z1 p(1 | x; θ)
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6 β (1) σ σσ σσ σσ σσ σσ σ σ σ σ
3,3,1 σ σ σ σ σ σ σ σσσ σσσ σσ σ
σ σ σ σ σ σ σ σσ σσ σ .. .. ..
x4,1 x4,2 x4,3 x4,4 x4,5 x4,6 (1) σ σσ σσ σσ σσ σσ σ σ σ σ . . .
β1,3,4 σ σ σ σ σ σ
σ σ σ σ σ σ β3,3,4,6σσ σσσσ σσσσ σσ
(2) ..
x5,1 x5,2 x5,3 x5,4 x5,5 x5,6 σ σσ σσ σσ σσ σσ σ σ σσ σσ σ .
σ σ σ σ σ σ σ σ σ zK p(K | x; θ)
σ σ σ σ σ σ σ σ σ
x6,1 x6,2 x6,3 x6,4 x6,5 x6,6 σ σσ σσ σσ σσ σσ σ σ σ σ σ
(1) σ σ σ σ σ σ
β3,3,4 σ σ σ σ σ σ
σ σ σ σ σ σ
Figure 7.11: A full CNN architecture for classification of grayscale 6 × 6 images. In the first convolutional layer
four kernels, each of size 3 × 3, produce a hidden layer with four channels. The first channel (in the bottom) is
visualized in red and the forth (on the top in blue). We use the stride [1,1] which maintains the number of rows and
columns. In the second convolutional layer, six kernels of size 3 × 3 × 4 and the stride [2,2] are used. They produce
a hidden layer with 3 rows, 3 columns and 6 channels. After the two convolutional layers follows a dense layer
where all 3 · 3 · 6 = 54 hidden units in the second hidden layer are densely connected to the third layer of hidden
units where all links have their unique parameters. We add an additional dense layer mapping down to the K logits.
The network ends with a softmax function to provide predicted class probabilities as output.
To use a neural network for prediction we need to find an estimate for the parameters θ.
b To do that we
solve an optimization problem on the form
n
1X
b = arg min J(θ)
θ where J(θ) = L(xi , yi , θ). (7.23)
θ n
i=1
We denote J(θ) as the cost function and L(xi , yi , θ) as the loss function. The functional form of the loss
function depends on if we have regression or a classification problem at hand, see e.g. (7.14) and (7.20).
These optimization problems can not be solved in closed form, so numerical optimization has to be
used. In Appendix B, an introduction to numerical optimization is provided. In all numerical optimization
algorithms the parameters are updated in an iterative manner. In deep learning we typically use various
versions of gradient descent:
1. Pick an initialization θ 0 .
2. Update the parameters as θ t+1 = θ t − γ∇θ J(θ t ) for t = 1, 2, . . . . (7.24)
3. Terminate when some criterion is fulfilled, and take the last θ t as θ.
b
In many applications of deep learning we cannot afford to compute the exact gradient ∇θ J(θ t ) at each
iteration. Instead we use approximations, which are explained in Section 7.4.2. In Section 7.4.3 strategies
how to tune the learning rate γ are presented and in Section 7.4.4 a popular regularization method called
dropout is described. First, however, a few words on how to initialize the training.
7.4.1 Initialization
The previous optimization problems (LASSO (2.31), logistic regression (3.8)) that we have encountered
have all been convex. This means that we can guarantee global convergence regardless of what initialization
θ 0 we use. In contrast, the cost functions for training neural networks is usually non-convex. This means
that the training is sensitive to the value of the initial parameters. Typically, we initialize all the parameters
93
7 Neural networks and deep learning
to small random numbers such that we ensure that the different hidden units encode different aspects of
the data. If the ReLU activation functions are used, offset elements b0 are typically initialized to a small
positive value such that it operates in the non-negative range of the ReLU.
Many problems that are addressed with deep learning contain more than a million training data points n,
and the design of the neural network is typically made such that θ has more than a million elements. This
provides a computational challenge.
A crucial component is the computation of the gradient required in the optimization routine (7.24)
n
1X
∇θ J(θ) = ∇θ L(xi , yi , θ). (7.25)
n
i=1
If the number of data points n is big, this operation is costly. However, we can often assume that the data
set is redundant meaning that many of the data points are similar. Then the gradient based on the first half
P n2
of the dataset ∇θ J(θ) ≈ i=1 P∇ θ L(xi , yi , θ) is almost identical to the gradient based on the second
half of the dataset ∇θ J(θ) ≈ i= n +1 ∇θ L(xi , yi , θ). Consequently, it is a waste of time of compute
n
2
the gradient based on the whole data set. Instead, we could compute the gradient based on the first half of
the data set, update the parameters, and then get the gradient for the new parameters based on the second
half of the data,
n
1 X2
These two steps would only require roughly half the computational time in comparison to if we had used
the whole data set for each gradient computation.
The extreme version of this strategy would be to use only a single data point each time when computing
the gradient. However, most commonly when training a deep neural network we do something in between,
using more than one data point but not all data points when computing the gradient We use a smaller set
of training data called a mini-batch. Typically, a mini-batch contains nb = 10, nb = 100 or nb = 1000
data points.
One important aspect when using mini-batches is that the different mini-batches are balanced and
representative for the whole data set. For example, if we have a big training data set with a few different
classes and the data set is sorted after the classes (i.e. samples belonging to class k = 1 are first, and so on),
a mini-batch with first nb samples would only include one class and hence not give a good approximation
of the gradient for the full data set.
For this reason, we prefer to draw nb training data points at random from the training data to form a
mini-batch. One implementation of this is to first randomly shuffle the training data, before dividing it into
mini-batches in an order manner. One complete pass through the training data is called an epoch. When
we have completed one epoch, we do another random reshuffling of the training data and do another pass
though the data set. We call this procedure stochastic gradient descent or mini-batch gradient descent. A
pseudo algorithm is presented in Algorithm 8.
Since the neural network model is a composition of multiple layers, the gradient of the loss function
with respect to all the parameters ∇θ L(xi , yi , θ) can be analytically and efficiently computed by
θ=θ t
applying the chain rule of differentiation. This is called back-propagation and is not described further here.
The interested reader can find more in, for example, Goodfellow, Bengio, and Courville 2016, Section 6.5.
94
7.4 Training a neural network
2. For i = 1 to E
a) Randomly shuffle the training data {(xi , yi )}ni=1 .
b) For j = 1 to n
nb
i. Approximate the gradient of the loss function using the mini-batch
i=(j−1)nb +1 ,
{(xi , yi )}jnb
P b
ĝt = n1b jn i=(j−1)nb +1 ∇ θ L(x i , y i , θ) .
θ=θ t
ii. Do a gradient step θ t+1 = θ t − γ ĝt .
iii. Update the iteration index t ← t + 1 .
An important tuning parameter for (stochastic) gradient descent is the learning rate γ. The learning rate γ
decides the length of the gradient step that we take at each iteration. If we use a too low learning rate, the
estimate θ t from one iteration to the next will not change much and the learning will progress slower than
necessarily. This is illustrated in Figure 7.12a for a small optimization problem with only one parameter θ.
In contrast, with a too big learning rate, the estimate will pass the optimum and never converge since the
step is too long, see Figure 7.12b. For a learning rate which neither is too slow nor too fast, convergence is
achieved in a reasonable amount of iterations. A good strategy to find a good learning rate is:
• if the error keeps getting worse or oscillates widely, reduce the learning rate
• if the error is fairly consistently but slowly increasing, increase the learning rate.
Convergence with gradient descent can be achieved with a constant learning rate since the gradient
itself approaches zero when we reach the optimum, hence also the gradient step γ∇θ J(θ)|θ=θt . However,
this is not true for stochastic gradient descent since the gradient ĝt is only an approximation of the true
gradient ∇θ J(θ)|θ=θt , and ĝt will not necessarily approach 0 as J(θ) approaches its minimum. As a
consequence, we will make a too big updates as we start approaching the optimum and the stochastic
gradient algorithm will not converge. In practice, we instead adjust the learning rate. We start with a fairly
high learning rate and then decay the learning rate to a certain level. This can, for example, be achieved by
the rule
t
γt = γmin + (γmax − γmin )e− τ . (7.27)
Here the learning rate starts at γmax and goes to γmin as t → ∞. How to pick the parameters γmin , γmax
and τ is a more of an art than science. As a rule of thumb γmin can be chosen approximately as 1% of
γmax . The parameter τ depends on the size of the data set and the complexity of the problem, but should
be chosen such that multiple epochs have passed before we reach γmin . The strategy to pick γmax can be
the same as for normal gradient descent explained above.
P∞
Under
P∞ 2 certain regularity conditions and if the so called Robbins-Monro condition holds: t=1 γt = ∞
and t=1 γt < ∞, then stochastic gradient descent converges almost surely to a local minimum. However,
to be able to satisfy the Robbins-Monro condition we need γt → 0 as t → ∞. In practice this it typically
not the case and we instead let the learning rate approach a non-zero level γmin > 0 by using a scheme like
the one in (7.27). This has been found to work better in practice in many situations, despite sacrificing the
theoretical convergence of the algorithm.
95
7 Neural networks and deep learning
J(θ) 1 1 1
0 0 0
−1 0 1 −1 0 1 −1 0 1
θ θ θ
(a) Low learning rate γ = 0.05 (b) High learning rate γ = 1.2 (c) Good learning rate γ = 0.3
Figure 7.12: Optimization using gradient descent of a cost function J(θ) where θ is a scalar parameter. In the
different subfigures we use a too low learning rate (a), a too high learning rate (b), and a good learning rate (c).
7.4.4 Dropout
Like all models presented in this course, neural network models can suffer from overfitting if we have a
too flexible model in relation to the complexity of the data. Bagging (James et al. 2013, Chapter 8.2) is
one way to reduce the variance and by that also the overfitting of the model. In bagging we train an entire
ensemble of models. We train all models (ensemble members) on a different data set each, which has been
bootstrapped (sampled with replacement) from the original training data set. To make a prediction, we
first make one prediction with each model (ensemble member), and then average over all models to obtain
the final prediction.
Bagging is also applicable to neural networks. However, it comes with some practical problems; a large
neural network model usually takes quite some time to train and it also has quite some parameters to store.
To train not just one but an entire ensemble of many large neural networks would thus be very costly, both
in terms of runtime and memory. Dropout (Srivastava et al. 2014) is a bagging-like technique that allows
us to combine many neural networks without the need to train them separately. The trick is to let the
different models share parameters with each other, which reduces the computational cost and memory
requirement.
Ensemble of sub-networks
Consider a neural network like the one in Figure 7.13a. In dropout we construct the equivalent to an
ensemble member by randomly removing some of the hidden units. We say that we drop the units, hence
the name dropout. By this we achieve a sub-network of our original network. Two such sub-networks
are displayed in Figure 7.13b. We randomly sample with a pre-defined probability which units to drop,
and the collection of dropped units in one sub-network is independent from the collection of dropped
units in another sub-network. When a unit is removed, also all its incoming and outgoing connections are
removed. Not only hidden units can be dropped, but also input variables.
Since all sub-networks are of the very same original network, the different sub-networks share some
(1)
parameters with each other. For example, in Figure 7.13b the parameter β55 is present in both sub-networks.
The fact that they share parameters with each other allow us to train the ensemble of sub-networks in an
efficient manner.
To train with dropout we use the mini-batch gradient descent algorithm described in Algorithm 8. In
each gradient step a mini-batch of data is used to compute an approximation of the gradient, as before.
However, instead of computing the gradient for the full network, we generate a random sub-network by
96
7.4 Training a neural network
(2)
β11
x1 σ σ
x2 σ σ
x3 σ σ
x4 σ σ (3)
β51
x5 σ σ
(1)
β55
×
x2 σ σ x2 σ ×
σ
x3 ×
σ σ ×
x 3 σ σ
×
x4 σ ×
σ (3)
β51 x4 ×
σ ×
σ
x5
(1)
σ σ x5
(1)
σ ×
σ
β55 β55
Figure 7.13: A neural network with two hidden layers (a), and two sub-networks with dropped units (b). The
collection of units that have been dropped are independent between the two sub-networks.
randomly dropping units as described above. We compute the gradient for that sub-network as if the
dropped units were not present and then do a gradient step. This gradient step only updates the parameters
present in the sub-network. The parameters that are not present are left untouched. In the next gradient
step we grab another mini-batch of data, remove another randomly selected collection of units and update
the parameters present in that sub-network. We proceed in this manner until some terminal condition is
fulfilled.
Dropout vs bagging
This procedure to generate an ensemble of models differs from bagging in a few ways:
• In bagging all models are independent in the sense that they have their own parameters. In dropout
the different models (the sub-networks) share parameters.
• In bagging each model is trained until convergence. In dropout each sub-network is only trained for
a singe gradient step. However, since they share parameters all models will be updated also when
the other networks are trained.
• Similar to bagging, in dropout we train each model on a data set that has been randomly selected
from our training data. However, in bagging we usually do it on a bootstrapped version of the whole
data set whereas in dropout each model is trained on a randomly selected mini-batch of data.
Even though dropout differs from bagging in some aspects it has empirically been shown to enjoy similar
properties as bagging in terms of avoiding overfittning and reducing the variance of the model.
After we have trained the sub-networks, we want to make a prediction based on an unseen input data
point x? . In bagging we evaluate all the different models in the ensemble and combine their results. This
would be infeasible in dropout due to the very large (combinatorial) number of possible sub-networks.
97
7 Neural networks and deep learning
(2)
rβb11
x1 σ σ
x2 σ σ
x3 σ σ b
p(k | x; θ)
(3)
x4 σ σ rβb51
x5 σ σ
(1)
rβb55
Figure 7.14: The network used for prediction after being trained with dropout. All units and links are present (no
dropout) but the weights going out from a certain unit is multiplied with the probability of that unit being included
during training. This is to compensate for the fact that some of them where dropped during training. Here all units
have been kept with the probability r during training (and dropped with the probability 1 − r).
However, there is a simple trick to approximately achieve the same result. Instead of evaluating all possible
sub-networks we simply evaluate the full network containing all the parameters. To compensate for the
fact that the model was trained with dropout, we multiply each estimated parameter going out from a
unit with the probability of that unit being included during training. This ensures that the expected value
of the input to a unit is the same during training and testing, as during training only a fraction of the
incoming links were active. For instance, assume that we during training kept a unit with probability p
in all layers, then during testing we multiply all estimated parameters with p before we do a prediction
based on network. This is illustrated in Figure 7.14. This procedure of approximating the average over all
ensemble members has been shown to work surprisingly well in practice even though there is not yet any
solid theoretical argument for the accuracy of this approximation.
As a way to reduce the variance and avoid overfitting, dropout can be seen as a regularization method.
There are plenty of other regularization methods for neural networks including parameter penalties (like
we did in ridge regression and LASSO in Section 2.6.1 and 2.6.2), early stopping (you stop the training
before the parameters have converged, and thereby avoid overfitting) and various sparse representations
(for example CNNs can be seen as a regularization method where most parameters are forced to be zero) to
mention a few. Since its invention, dropout has become one of the most popular regularization techniques
due to its simplicity, computationally cheap training and testing procedure and its good performance. In
fact, a good practice of designing a neural network is often to extended the network until you see that it
starts overfitting, extended it a bit more and add a regularization like dropout to avoid that overfitting.
98
7.5 Perspective and further reading
recognition, these deep models are now the dominant methods of use and they reach almost human
performance on some specific tasks (LeCun, Bengio, and Hinton 2015). Recent advances based on deep
neural networks have generated algorithms that can learn how to play computer games based on pixel
information only (Mnih et al. 2015), and automatically understand the situation in images for automatic
caption generation (Xu et al. 2015).
A fairly recent and accessible introduction and overview of deep learning is provided by LeCun, Bengio,
and Hinton (2015), and a recent textbook by Goodfellow, Bengio, and Courville (2016).
99
A Probability theory
A random variable z with a uniform distribution on the interval [0, 3] has the pdf p(z) = 1/3 for z ∈ [0, 3],
and otherwise p(z = 0). Note that pmfs are upper bounded by 1, whereas Ra pdf can possibly take values
larger than 1. However, it holds for pdfs that they always integrates1 to 1: p(z) = 1.
A common probability distribution is the Gaussian (or Normal) distribution, whose density is defined as
1 (z − µ)2
2
p(z) = N z | µ, σ = √ exp − , (A.2)
σ 2π 2σ 2
where we havemade use of exp to denote the exponential function; exp(x) = ex . We also use the notation
z ∼ N µ, σ 2 to say that z has a Gaussian distribution with parameters µ and σ 2 (i.e., its probability
density function is given by (A.2)). The symbol ∼ reads ‘distributed according to’.
The expected value or mean of the random variable z is given by
Z
E[z] = zp(z)dz. (A.3)
We can also compute the expected value of some arbitrary function g(z) applied to z as
Z
E[g(z)] = g(z)p(z)dz. (A.4)
For a scalar random variable with mean µ = E[z] the variance is defined as
The variance measures the ‘spread’ of the distribution, i.e. how far a set of random number drawn from
the distribution are spread out from their mean. The variance is always non-negative. For the Gaussian
distribution (A.2) the mean and variance are given by the parameters µ and σ 2 respectively.
Now, consider two random variables z1 and z2 (both of which could be vectors). An important property
of pairs of random variables is that of independence. The variables z1 and z2 are said to be independent
1
For notational convenience, when the integration is over the whole domain of z we simply write .
R
101
A Probability theory
p(z2 )
p(z1 )
probability density
p(z1,z2 =γ)
0 p(z1,z2 ) 5
1 4
2 3
z1 3 2 z2
γ
4 1
Figure A.1: Illustration of a two-dimensional joint probability distribution p(z1 , z2 ) (the surface) and its two
marginal distributions p(z1 ) and p(z2 ) (the black lines). We also illustrate the conditional distribution p(z1 |z2 = γ)
(the red line), which is the distribution of the random variable z1 conditioned on the observation z2 = γ (γ = 1.5 in
the plot).
if the joint pdf factorizes according to p(z1 , z2 ) = p(z1 )p(z2 ). Furthermore, for independent random
variables the expected value of any separable function factorizes as E[g1 (z1 )g2 (z2 )] = E[g1 (z1 )]E[g2 (z2 )].
From the joint probability density function we can deduce both its two marginal densities p(z1 ) and
p(z2 ) using marginalization, as well as the so called conditional probability density function p(z2 | z1 )
using conditioning. These two concepts will be explained below.
A.1.1 Marginalization
Consider a multivariate random variable z which is composed of two components z1 and z2 , which
could be either scalars or vectors, as z = [z1T , z2T ]T . If we know the (joint) probability density function
p(z) = p(z1 , z2 ), but are interested only in the marginal distribution for z1 , we can obtain the density
p(z1 ) by marginalization
Z
p(z1 ) = p(z1 , z2 )dz2 . (A.6)
The other marginal p(z2 ) is obtained analogously by integrating over z1 instead. In Figure A.1 a joint
two-dimensional density p(z1 , z2 ) is illustrated along with their marginal densities p(z1 ) and p(z2 ).
102
A.2 Approximating an integral with a sum
A.1.2 Conditioning
Consider again the multivariate random variable z which can be partitioned in two parts z = [z1T , z2T ]T .
We can now define the conditional distribution of z1 , conditioned on having observed a value z2 = z2 , as
p(z1 , z2 )
p(z1 | z2 ) = . (A.7)
p(z2 )
If we instead have observed a value of z1 = z1 and want to use that to find the conditional distribution of
z2 given z1 = z1 , it can be done analogously. In Figure A.1 a joint two-dimensional probability density
function p(z1 , z2 ) is illustrated along with a conditional probability density function p(z1 | z2 ).
From (A.7) it follows that the joint probability density function p(z1 , z2 ) can be factorized into the
product of a marginal times a conditional,
If we use this factorization for the denominator of the right-hand-side in (A.7) we end up with the
relationship
p(z2 | z1 )p(z1 )
p(z1 | z2 ) = . (A.9)
p(z2 )
if each zi is drawn independently from zi ∼ p(z). This is called Monte Carlo integration. The approximate
equality becomes exact with probability one in the limit as the number of samples M → ∞.
103
B Unconstrained numerical optimization
Given a function L(θ), the optimization problem is about finding the value of the variable x for which the
function L(θ) is either minimized or maximized. To be precise it will here be formulated as finding the
value θb that minimizes1 the function L(θ) according to
where the vector θ is allowed to be anywhere in Rn , motivating the name unconstrained optimization. The
function L(θ) will be referred to as the cost function2 , with the motivation that the minimization problem
in (B.1) is striving to minimize some cost. We will make the assumption that the cost function L(θ) is
continuously differentiable on Rn . If there are requirements on θ (e.g. that its components θ have to satisfy
a certain equation g(θ) = 0) the problem is instead referred to as a constrained optimization problem.
The unconstrained optimization problem (B.1) is ever-present across the sciences and engineering, since
it allows us to find the best—in some sense—solution to a particular problem. One example of this arises
when we are searching for the parameters in a linear regression problem by finding the parameters that
make the available measurements as likely as possible by maximizing the likelihood function. For a linear
model with Gaussian noise, this resulted in a least squares problem, for which there is an explicit expression
(the normal equations, (2.17)) describing the solution. However, for most optimization problems that
we face there are no explicit solutions available, forcing us to use approximate numerical methods in
solving these problems. We have seen several concrete examples of this kind, for example the optimization
problems arising in deep learning and logistic regression. This appendix provides a brief introduction to
the practical area of unconstrained numerical optimization.
The key in assembling a working optimization algorithm is to build a simple and useful model of the
complicated cost function L(θ) around the current value for θ. The model is often local in the sense that
it is only valid in a neighbourhood of this value. The idea is then to exploit this model to select a new
value for θ that corresponds to a smaller value for the cost function L(θ). The procedure is then repeated,
which explains why most numerical optimization algorithms are of iterative nature. There are of course
many different ways in which this can be done, but they all share a few key parts which we outline below.
Note that we only aim to provide the overall strategies underlying practical unconstrained optimization
algorithms, for precise details we refer to the many textbooks available on the subject, some of which are
referenced towards the end.
105
B Unconstrained numerical optimization
be another increment d1 that we can add to θ1 such that L(θ1 + d1 ) < L(θ1 ). This procedure is repeated
until it is no longer possible to find an increment that decrease the value of the objective function. We have
then found a local minimizer. Most of the algorithms capable of solving (B.1) are iterative procedures of
this kind. Before moving on, let us mention that the increment d is often resolved into two parts according
to
d = γp. (B.2)
Here, the scalar and positive parameter γ is commonly referred to as the step length and the vector p ∈ Rn
is referred to as the search direction. The intuition is that the algorithm is searching for the solution by
moving in the search direction and how far it moves in this this direction is controlled by the step length.
The above development does of course lead to several questions, where the most pertinent are the
following:
1. How can we compute a useful search direction p?
2. How big steps should we make, i.e. what is a good value of the step length γ?
3. How do we determine when we have reached a local minimizer, and stop searching for new
directions?
Throughout the rest of this section we will briefly discuss these questions and finally we will assemble the
general form of an algorithm that is often used for unconstrained minimization.
A straightforward way of finding a general characterization of all search directions p resulting in a
decrease in the value of the cost function, i.e. directions p such that
is to build a local model of the cost function around the point θ. One model of this kind is provided by
Taylor’s theorem, which builds a local polynomial approximation of a function around some point of
interest. A linear approximation of the cost function L(θ) around the point θ is given by
By inserting the linear approximation (B.4) of the objective function into (B.3) we can provide a more
precise formulation of how to find a search direction p such that L(θ + p) < L(θ) by asking for which p it
holds that L(θ) + pT ∇L(θ) < L(θ), which can be further simplified into
Inspired by the inequality above we chose a generic description of the search direction according to
p = −V ∇L(θ), V 0, (B.6)
where we have introduced some extra flexibility via the positive definite scaling matrix V . The inspiration
came from the fact that by inserting (B.6) into (B.5) we obtain
where the last inequality follows from the positivity of the squared weighted two-norm, which is defined
as kak2W = aT W a. This shows that p = −V ∇L(θ) will indeed result in a search direction that decreases
the value of the objective function. We refer to such a search direction as a descent direction.
The strategy summarized in Algorithm 9 is referred to as line search. Note that we have now introduced
subscript t to clearly show the iterative nature. The algorithm searches along the line defined by starting at
the current iterate θt and then moving along the search direction pt . The decision of how far to move
along this line is made by simply minimizing the cost function along the line
106
B.2 Commonly used search directions
3. end while
Note that this is a one-dimensional optimization problem, and hence simpler to deal with compared to the
original problem. The step length γt that is selected in (B.8) controls how far to move along the current
search direction pt . It is sufficient to solve this problem approximately in order to find an acceptable step
length, since as long as L(θt + γt pt ) < L(θt ) it is not crucial to find the global minimizer for (B.8).
There are several different indicators that can be used in designing a suitable stopping criteria for row 2
in Algorithm 9. The task of the stopping criteria is to control when to stop the iterations. Since we know
that the gradient is zero at a stationary point it is useful to investigate when the gradient is close to zero.
Another indicator is to keep an eye on the size of the increments between adjacent iterates, i.e. when θt+1
is close to θt .
In the so-called trust region strategy the order of step 2a and step 2b in Algorithm 9 is simply reversed,
i.e. we first decide how far to step and then we chose in which direction to move.
where ϕ denotes the angle between the two vectors p and ∇L(θt ). Since we are only interested in finding
the direction we can without loss of generality fix the length of p, implying the scalar product pT ∇L(θt )
is made as small as possible by selecting ϕ = π, corresponding to
p = −∇L(θt ). (B.10)
Recall that the gradient vector at a point is the direction of maximum rate of change of the function at that
point. This explains why the search direction suggested in (B.10) is referred to as the steepest descent
direction.
3
The scalar (or dot) product of two vectors a and b is defined as aT b = kakkbk cos(ϕ), where kak denotes the length
(magnitude) of the vector a and ϕ denotes the angle between a and b.
107
B Unconstrained numerical optimization
Sometimes, the use of the steepest descent direction can be very slow. The reason for this is that there is
more information available about the cost function that the algorithm can make use of, which brings us to
the Newton and the quasi-Newton directions described below. They make use of additional information
about the local geometry of the cost function by employing a more descriptive local model.
where gt = ∇L(θ)|θ=θt denotes the cost function gradient and Ht = ∇2 L(θ)|θ=θt denotes the Hessian,
both evaluated at the current iterate θt . The idea behind the Newton direction is to select the search
direction that minimizes the quadratic model in (B.11), which is obtained by setting its derivative
∂m(θt , pt )
= gt + Ht pt (B.12)
∂pt
to zero, resulting in
pt = −Ht−1 gt . (B.13)
It is often too difficult or too expensive to compute the Hessian, which has motivated the development of
search directions employing an approximation of the Hessian. The generic name for these are quasi-Newton
directions.
B.2.3 Quasi-Newton
The quasi-Newton direction makes use of a local quadratic model m(θt , pt ) of the cost function according
to (B.11), similarly to what was done in finding the Newton direction. However, rather than assuming that
the Hessian is available, the Hessian will now instead be learned from the information that is available in
the cost function values and its gradients.
Let us first denote the line segment connecting two adjacent iterates θt and θt+1 by
where we have defined yt = gt+1 − gt and st = θt+1 − θt . An interpretation of the above equation is
that the difference between two consecutive gradients yt is given by integrating the Hessian times st for
108
B.3 Further reading
points θ along the line segment rt (τ ) defined in (B.14). The approximation underlying quasi-Newton
methods is now to assume that this integral can be described by a constant matrix Bt+1 , resulting in the
following approximation
yt = Bt+1 st (B.18)
of the integral (B.17), which is sometimes referred to as the secant condition or the quasi-Newton equation.
The secant condition above is still not enough to determine the matrix Bt+1 , since even though we
know that Bt+1 is symmetric there are still too many degrees of freedom available. This is solved using
regularization and Bt+1 is selected as the solution to
for some weighting matrix W . Depending on which weighting matrix that is used we obtain different
algorithms. The most common quasi-Newton algorithms are referred to as BFGS (named after Broyden,
Fletcher, Goldfarb and Shanno), DFP (named after Davidon, Fletcher and Powell) and Broyden’s method.
The resulting Hessian approximation Bt+1 is then used in place of the true Hessian.
109
Bibliography
Abu-Mostafa, Yaser S., Malik Magdon-Ismail, and Hsuan-Tien Lin (2012). Learning From Data. A short
course. AMLbook.com.
Barber, David (2012). Bayesian reasoning and machine learning. Cambridge University Press.
Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning. Springer.
Bottou, L., F. E. Curtis, and J. Nocedal (2017). Optimization methods for large-scale machine learning.
Tech. rep. arXiv:1606.04838v2.
Boyd, S. and L. Vandenberghe (2004). Convex Optimization. Cambridge, UK: Cambridge University
Press.
Breiman, Leo (Oct. 2001). “Random Forests”. In: Machine Learning 45.1, pp. 5–32. issn: 1573-0565. doi:
10.1023/A:1010933404324. url: https://doi.org/10.1023/A:1010933404324.
Deisenroth, M. P., A. Faisal, and C. O. Ong (2019). Mathematics for machine learning. Cambridge
University Press.
Dheeru, Dua and Efi Karra Taniskidou (2017). UCI Machine Learning Repository. url: http://archive.
ics.uci.edu/ml.
Efron, Bradley and Trevor Hastie (2016). Computer age statistical inference. Cambridge University Press.
Ezekiel, Mordecai and Karl A. Fox (1959). Methods of Correlation and Regression Analysis. John Wiley
& Sons, Inc.
Freund, Yoav and Robert E. Schapire (1996). “Experiments with a new boosting algorithm”. In: Proceedings
of the 13th International Conference on Machine Learning (ICML).
Friedman, Jerome (2001). “Greedy function approximation: A gradient boosting machine”. In: Annals of
Statistics 29.5, pp. 1189–1232.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani (2000). “Additive logistic regression: a statistical
view of boosting (with discussion)”. In: The Annals of Statistics 28.2, pp. 337–407.
Gelman, Andrew et al. (2013). Bayesian data analysis. 3rd ed. CRC Press.
Ghahramani, Zoubin (May 2015). “Probabilistic machine learning and artificial intelligence”. In: Nature
521.7553, pp. 452–459.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville (2016). Deep Learning. http://www.deeplearningbook.
org. MIT Press.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman (2009). The elements of statistical learning. Data
mining, inference, and prediction. 2nd ed. Springer.
Hastie, Trevor, Robert Tibshirani, and Martin J. Wainwright (2015). Statistical learning with sparsity: the
Lasso and generalizations. CRC Press.
Hoerl, Arthur E. and Robert W. Kennard (1970). “Ridge regression: biased estimation for nonorthogonal
problems”. In: Technometrics 12.1, pp. 55–67.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani (2013). An introduction to statistical
learning. With applications in R. Springer.
Jordan, M. I. and T. M. Mitchell (2015). “Machine learning: trends, perspectives, and prospects”. In:
Science 349.6245, pp. 255–260.
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton (2015). “Deep learning”. In: Nature 521, pp. 436–444.
LeCun, Yann, Bernhard Boser, et al. (1990). “Handwritten Digit Recognition with a Back-Propagation
Network”. In: Advances in Neural Information Processing Systems (NIPS), pp. 396–404.
MacKay, D. J. C. (2003). Information theory, inference and learning algorithms. Cambridge University
Press.
111
Bibliography
Mason, Llew, Jonathan Baxter, Peter Bartlett, and Marcus Frean (1999). “Boosting Algorithms as Gradient
Descent”. In: Proceedings of the 12th International Conference on Neural Information Processing
Systems (NIPS).
McCulloch, Warren S and Walter Pitts (1943). “A logical calculus of the ideas immanent in nervous
activity”. In: The bulletin of mathematical biophysics 5.4, pp. 115–133.
Mnih, Volodymyr et al. (2015). “Human-level control through deep reinforcement learning”. In: Nature
518.7540, pp. 529–533.
Murphy, Kevin P. (2012). Machine learning – a probabilistic perspective. MIT Press.
Nocedal, J. and S. J. Wright (2006). Numerical Optimization. 2nd ed. Springer Series in Operations
Research. New York, USA: Springer.
Srivastava, Nitish et al. (2014). “Dropout: A simple way to prevent neural networks from overfitting”. In:
The Journal of Machine Learning Research 15.1, pp. 1929–1958.
Tibshirani, Robert (1996). “Regression Shrinkage and Selection via the LASSO”. In: Journal of the Royal
Statistical Society (Series B) 58.1, pp. 267–288.
Wills, A. G. (2017). “Real-time optimisation for embedded systems”. Lecture notes.
Xu, Kelvin et al. (2015). “Show, attend and tell: Neural image caption generation with visual attention”.
In: Proceedings of the International Conference on Learning representations (ICML).
112