Supervised Machine Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 112

Shared by Ravit Jain

Supervised Machine Learning


Lecture notes for the Statistical Machine Learning course

Andreas Lindholm, Niklas Wahlström,


Fredrik Lindsten, Thomas B. Schön

Version: March 12, 2019

Department of Information Technology, Uppsala University

Follow Ravit Jain for more Insightful resources and books


0.1 About these lecture notes
These lecture notes are written for the course Statistical Machine Learning 1RT700, given at the Department
of Information Technology, Uppsala University, spring semester 2019. They will eventually be turned into
a textbook, and we are very interested in all type of comments from you, our dear reader. Please send your
comments to [email protected]. Everyone who contributes with many useful comments will get
a free copy of the book.
During the course, updated versions of these lecture notes will be released. The major changes are
noted below in the changelog:

Date Comments
2019-01-18 Initial version. Chapter 6 missing.
2019-01-23 Typos corrected, mainly in Chapter 2 and 5. Section 2.6.3 added.
2019-01-28 Typos corrected, mainly in Chapter 3.
2019-02-07 Chapter 6 added.
2019-03-04 Typos corrected.
2019-03-11 Typos (incl. eq. (3.25)) corrected.
2019-03-12 Typos corrected.

2
Contents

0.1 About these lecture notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1 Introduction 7
1.1 What is machine learning all about? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Regression and classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Overview of these lecture notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 The regression problem and linear regression 11


2.1 The regression problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 The linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Describe relationships — classical statistics . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Predicting future outputs — machine learning . . . . . . . . . . . . . . . . . . . 12
2.3 Learning the model from training data . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Least squares and the normal equations . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Nonlinear transformations of the inputs – creating more features . . . . . . . . . . . . . 16
2.5 Qualitative input variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6.1 Ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.2 LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.3 General cost function regularization . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.A Derivation of the normal equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.A.1 A calculus approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.A.2 A linear algebra approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 The classification problem and three parametric classifiers 25


3.1 The classification problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.1 Learning the logistic regression model from training data . . . . . . . . . . . . . 27
3.2.2 Decision boundaries for logistic regression . . . . . . . . . . . . . . . . . . . . 28
3.2.3 Logistic regression for more than two classes . . . . . . . . . . . . . . . . . . . 30
3.3 Linear and quadratic discriminant analysis (LDA & QDA) . . . . . . . . . . . . . . . . 31
3.3.1 Using Gaussian approximations in Bayes’ theorem . . . . . . . . . . . . . . . . 31
3.3.2 Using LDA and QDA in practice . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Bayes’ classifier — a theoretical justification for turning p(y | x) into yb . . . . . . . . . . 38
3.4.1 Bayes’ classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.2 Optimality of Bayes’ classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.3 Bayes’ classifier in practice: useless, but a source of inspiration . . . . . . . . . 39
3.4.4 Is it always good to predict according to Bayes’ classifier? . . . . . . . . . . . . 39
3.5 More on classification and classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.1 Linear and nonlinear classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.3 Evaluating binary classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3
Contents

4 Non-parametric methods for regression and classification: k-NN and trees 43


4.1 k-NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Decision boundaries for k-NN . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.2 Choosing k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.2 Training a classification tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.3 Other splitting criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.4 Regression trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5 How well does a method perform? 53


5.1 Expected new data error Enew : performance in production . . . . . . . . . . . . . . . . 53
5.2 Estimating Enew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.1 Etrain 6≈ Enew : We cannot estimate Enew from training data . . . . . . . . . . . . 55
5.2.2 Etest ≈ Enew : We can estimate Enew from test data . . . . . . . . . . . . . . . . 55
5.2.3 Cross-validation: Eval ≈ Enew without setting aside test data . . . . . . . . . . . 56
5.3 Understanding Enew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3.1 Enew = Etrain + generalization error . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3.2 Enew = bias2 + variance + irreducible error . . . . . . . . . . . . . . . . . . . . 62

6 Ensemble methods 67
6.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.1.1 Variance reduction by averaging . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.1.2 The bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.2 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3.1 The conceptual idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3.2 Binary classification, margins, and exponential loss . . . . . . . . . . . . . . . . 75
6.3.3 AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3.4 Boosting vs. bagging: base models and ensemble size . . . . . . . . . . . . . . 79
6.3.5 Robust loss functions and gradient boosting . . . . . . . . . . . . . . . . . . . . 80
6.A Classification loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

7 Neural networks and deep learning 83


7.1 Neural networks for regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.1.1 Generalized linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.1.2 Two-layer neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.1.3 Matrix notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.1.4 Deep neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.1.5 Learning the network from data . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.2 Neural networks for classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.2.1 Learning classification networks from data . . . . . . . . . . . . . . . . . . . . 89
7.3 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.3.1 Data representation of an image . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.3.2 The convolutional layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.3.3 Condensing information with strides . . . . . . . . . . . . . . . . . . . . . . . . 91
7.3.4 Multiple channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.3.5 Full CNN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.4 Training a neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.4.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.4.2 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4
Contents

7.4.3 Learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95


7.4.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.5 Perspective and further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

A Probability theory 101


A.1 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A.1.1 Marginalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
A.1.2 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
A.2 Approximating an integral with a sum . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

B Unconstrained numerical optimization 105


B.1 A general iterative solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
B.2 Commonly used search directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
B.2.1 Steepest descent direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
B.2.2 Newton direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
B.2.3 Quasi-Newton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
B.3 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Bibliography 111

5
1 Introduction

1.1 What is machine learning all about?


Machine learning gives computers the ability to learn without being explicitly programmed for the task at
hand. The learning happens when data is combined with mathematical models, for example by finding
suitable values of unknown variables in the model. The most basic example of learning could be that of
fitting a straight line to data, but machine learning usually deals with much more flexible models than
straight lines. The point of doing this is that the result can be used to draw conclusions about new data,
that was not used in learning the model. If we learn a model from a data set of 1000 puppy images, the
model might — if it is wisely chosen — be able to tell whether another image (not among the 1000 used
for learning) depicts a puppy or not. That is know as generalization.

The science of machine learning is about learning models that generalize well.

These lecture notes are exclusively about supervised learning, which refers to the problem where
the data is on the form {xi , yi }ni=1 , where xi denotes inputs1 and yi denotes outputs2 . In other words,
in supervised learning we have labeled data in the sense that each data point has an input xi and an
output yi which explicitly explains ”what we see in the data”. For example, to check for signs of heart
disease medical doctors makes use of a so-called electrocardiogram (ECG) which is a test that measures
the electrical activity of the heart via electrodes placed on the skin of the patients chest, arms and legs.
Based on these readings a skilled medical doctor can then make a diagnosis. In this example the ECG
measurements constitutes the input x and the diagnosis provided by the medical doctor constitutes the
output y. If we have access to a large enough pool of labeled data of this kind (where we both have the
ECG reading x and the diagnosis y) we can use supervised machine learning to learn a model for the
relationship between x and y. Once the model is learned, is can be used to diagnose new ECG readings,
for which we do not (yet) know the diagnosis y. This is called a prediction, and we use yb to denote it. If
the model is making good predictions (close to the true y) also for ECGs not in the training data, we have
a model which generalizes well.
One of the most challenging problems with supervised learning is that it requires labeled data, i.e.
both the inputs and the corresponding outputs {xi , yi }ni=1 . This is challenging because the process of
labeling data is often expensive and sometimes also difficult or even impossible since it requires humans
to interpret the input and provide the correct output. The situation is made even worse due to the fact that
most of the state-of-the-art methods require a lot of data to perform well. This situation has motivated the
development of unsupervised learning methods which only require the input data {xi }ni= , i.e. so-called
unlabeled data. An important subproblem is that of clustering, where data is automatically organized
into different groups based on some notion of similarity. There is also an increasingly important middle
ground referred to as semi-supervised learning, where we make use of both labeled and unlabeled data.
The reason being that we often have access to a lot of unlabeled data, but only a small amount of labeled
data. However, this small amount of labeled data might still prove highly valuable when used together
with the much larger set of unlabeled data.
In the area of reinforcement learning, another branch of machine learning, we do not only want to make
use of measured data in order to be able to predict something or understand a given situation, but instead
1
Some common synonyms used for the input variable include feature, predictor, regressor, covariate, explanatory variable,
controlled variable and independent variable.
2
Some common synonyms used for the output variable include response, regressand, label, explained variable, predicted
variable and dependent variable.

7
1 Introduction

we want to develop a system that can learn how to take actions in the real world. The most common
approach is to learn these actions by trying to maximize some kind of reward encouraging the desired
state of the environment. The area of reinforcement learning has very strong ties to control theory. Finally
we mention the emerging area of causal learning where the aim is to tackle the much harder problem of
learning cause and effect relationships. This is very different from the other facets of machine learning
briefly introduced above, where it was sufficient to learn associations/correlations between the data. In
causal learning the aim is to move beyond learning correlations and instead trying to learn causal relations.

1.2 Regression and classification


A useful categorization of supervised machine learning algorithms is obtained by differentiating with
respect to the type—quantitative or a qualitative—of output variable involved in the problem. Let us first
have a look at when a variable in general is to be considered as quantitative or qualitative, respectively.
See Table 1.1 for a few examples.

Table 1.1: Examples of quantitative and qualitative variables.


Variable type Example Handle as
Numeric (continuous) 32.23 km/h, 12.50 km/h, 42.85 km/h Quantitative
Numeric (discrete) with natural ordering 0 children, 1 child, 2 children Quantitative
Numeric (discrete) without natural ordering 1 = Sweden, 2 = Denmark, 3 = Norway Qualitative
Text (not numeric) Uppsala University, KTH, Lund University Qualitative

Depending on whether the output of a problem is quantitative or qualitative , we refer to the problem as
either regression or classification.

Regression means the output is quantitative, and classification means the output is qualitative.

This means that whether a problem is about regression or classification depends only on its output. The
input can be either quantitative or qualitative in both cases.
The distinction between quantitative and qualitative, and thereby between regression and classification,
is however somewhat arbitrary, and there is not always a clear answer: one could for instance argue
that having no children is something qualitatively different than having children, and use the qualitative
output “children: yes/no”, instead of “0, 1 or 2 children”, and thereby turn a regression problem into a
classification problem, for example.

8
1.3 Overview of these lecture notes

1.3 Overview of these lecture notes


The following sketch gives an idea on how the chapters are connected.

Chapter 1: Introduction

Chapter 3: The classification Chapter 4: Non-parametric


Chapter 2: The regression
problem and three parametric methods for regression and
problem and linear regression
classifiers classification: k-NN and trees

Chapter 5: How well does a Chapter 7: Neural networks


Chapter 6: Ensemble methods
method perform? and deep learning

needed
recommended

1.4 Further reading


There are by now quite a few extensive textbooks available on the topic of machine learning which
introduce the area in slightly different ways compared to what we do in this book. The book of Hastie,
Tibshirani, and Friedman 2009 introduce the area of statistical machine learning in a mathematically
solid and accessible manner. A few years later the authors released a new version of their book which is
mathematically significantly lighter (James et al. 2013). They still do a very nice work of conveying the
main ideas. These books do not venture long into the world of Bayesian methods. However, there are
several complementary books doing a good job at covering the Bayesian methods as well, see e.g. (Barber
2012; Bishop 2006; Murphy 2012). MacKay (2003) provided a rather early account drawing interesting
and useful connections to information theory. It is still very much worth looking into. Finally, we mention
the work of Efron and Hastie 2016, where the authors takes a constructive historical approach to the
development of this new area covering the revolution in data analysis that emerged with the computers. A
contemporary introduction to the mathematics of machine learning is provided by (Deisenroth, Faisal, and
Ong 2019). Two relatively recent papers introducing the area are available here Ghahramani 2015; Jordan
and Mitchell 2015.
The scientific field of Machine Learning is extremely vibrant and active at the moment. The two leading
conferences within the area are The International Conference on Machine Learning (ICML) and the The
Conference on Neural Information Processing Systems (NeurIPS). Both are held on a yearly basis and
all the new research presented at these two conferences are freely available via their websites (icml.cc
and neurips.cc). Two additional conferences in the area are The International Conference on Artificial
Intelligence and Statistics (AISTATS) and The International Conference on Learning Representations
(ICLR). The leading journals in the area are the Journal of Machine Learning Research (JMLR) and the
IEEE Transactions on Pattern Analysis and Machine Intelligence. There are also quite a lot of relevant
work published within statistical journals, in particular within the area of computational statistics.

9
2 The regression problem and linear regression
The first problem we will study is the regression problem. Regression is one of the two main problems
that we cover in these notes. (The other one is classification). The first method we will encounter is linear
regression, which is one (of many) solutions to the regression problem. Even though the relative simplicity
of linear regression, it is surprisingly useful and it also constitutes an important building block for more
advanced methods (such as deep learning, Chapter 7).

2.1 The regression problem


Regression refers to the problem of learning the relationships between some (qualitative or quantitative1 )
input variables x = [x1 x2 . . . xp ]T and a quantitative output variable y. In mathematical terms, regression
is about learning a model f
y = f (x) + ε, (2.1)
where ε is a noise/error term which describes everything that cannot be captured by the model. With our
statistical perspective, we consider ε to be a random variable that is independent of x and has a mean
value of zero.
Throughout this chapter, we will use the dataset introduced in Example 2.1 with car stopping distances
to illustrate regression. In a sentence, the problem is to learn a regression model which can tell what
distance is needed for a car to come to a full stop, given its current speed.
Example 2.1: Car stopping distances

Ezekiel and Fox (1959) presents a dataset with 62 observations of the distance needed for various cars at
different initial speeds to break to a full stop.a The dataset has the two following variables:
- Speed: The speed of the car when the break signal is given.
- Distance: The distance traveled after the signal is given until the car has reached a full stop.
We decide to interpret Speed as the input variable x, and Distance as the output variable y.

150
Data
Distance (feet)

100

50

0
0 10 20 30 40

Speed (mph)

Our goal is to use linear regression to estimate (that is, to predict) how long the stopping distance would
be if the initial speed would be 33 mph or 45 mph (two speeds at which no data has been recorded).
a
The dataset is somewhat dated, so the conclusions are perhaps not applicable to modern cars. We believe, however,
that the reader is capable of pretending that the data comes from her/his own favorite example instead.

1
We will start with quantitative input variables, and discuss qualitative input variables later in 2.5.

11
2 The regression problem and linear regression

2.2 The linear regression model


The linear regression model describes the output variable y (a scalar) as an affine combination of the input
variables x1 , x2 , . . . , xp (each a scalar) plus a noise term ε,

y = β0 + β1 x1 + β2 x2 + · · · + βp xp +ε. (2.2)
| {z }
f (x;β)

We refer to the coefficients β0 , β1 , . . . βp as the parameters in the model, and we sometimes refer to β0
specifically as the intercept term. The noise term ε accounts for non-systematic, i.e., random, errors
between the data and the model. The noise is assumed to have mean zero and to be independent of x.
Machine learning is about training, or learning, models from data. Hence, the main part of this chapter will
be devoted to how to learn the parameters β0 , β1 , . . . , βp from some training dataset T = {(xi , yi )}ni=1 .
Before we dig into the details in Section 2.3, let us just briefly start by discussing the purpose of using
linear regression. The linear regression model can namely be used for, at least, two different purposes:
to describe relationships in the data by interpreting the parameters β = [β0 β1 . . . βp ]T , and to predict
future outputs for inputs that we have not yet seen.

Remark 2.1 It is possible to formulate the model also for multiple outputs y1 , y2 , . . . , see the exercises.
This is commonly referred to as multivariate linear regression.

2.2.1 Describe relationships — classical statistics


An often posed question in sciences such as medicine and sociology, is to determine whether there is a
correlation between some variables or not (‘do you live longer if you only eat sea food?’, etc.). Such
questions can be addressed by studying the parameters β in the linear regression model, after the parameter
has been learned from data. The most common question is perhaps whether it can be indicated that some
correlation between two variables x1 and y exists, which can be done with the following reasoning: If
β1 = 0, it would indicate that there is no correlation between y and x1 (unless the other inputs also depend
on x1 ). By estimating β1 together with a confidence interval (describing the uncertainty of the estimate),
one can rule out (with a certain significance level) that x1 and y are uncorrelated if 0 is not contained in
the confidence interval for β1 . The conclusion is then instead that some correlation is likely to be present
between x1 and y. This type of reasoning is referred to as hypothesis testing and it constitutes an important
branch of classical statistics. However, we shall mainly be concerned with another purpose of the linear
regression model, namely to make predictions.

2.2.2 Predicting future outputs — machine learning


In machine learning, the emphasis is rather on predicting some (not yet seen) output yb? for some new
input x? = [x?1 x?2 . . . x?p ]T . To make a prediction for a test input x? , we insert it into the model (2.2).
Since ε (by assumption) has mean value zero, we take the prediction as

yb? = β0 + β1 x?1 + β2 x?2 + · · · + βp x?p . (2.3)

We use the symbol b on y? to indicate that it is a prediction, our best guess. If we were able to somehow
observe the actual output from x? , we would denote it by y? (without a hat).

12
2.3 Learning the model from training data

Linear regression model

prediction
b?
y Data
Prediction

output y
y3
ε

data
y2 ε
y1
ε

x1 x2 x3 x?

data test input


input x
Figure 2.1: Linear regression with p = 1: The black dots represent n = 3 data points, from which a linear regression
model (blue line) is learned. The model does not fit the data perfectly, but there is a remaining error/noise ε (green).
The model can be used to predict (red cross) the output yb? for a test input point x? .

2.3 Learning the model from training data


To use the linear regression model, we first need to learn the unknown parameters β0 , β1 , . . . , βp from
a training dataset T . The training data consists of n samples of the output variable y, we call them yi
(i = 1, . . . , n), and the corresponding n samples xi (i = 1, . . . , n) (each a column vector). We write the
dataset in the matrix form
     
1 −xT 1− y1 xi1
1 −xT −  y2  xi2 
 2     
X = .
. ..  , y =  ..  , where each xi =  ..  . (2.4)
. .  .  . 
1 −xT
n− yn xip

Note that X is a n × (p + 1) matrix, and y a n-dimensional vector. The first column of X, with only ones,
corresponds to the intercept term β0 in the linear regression model (2.2). If we also stack the unknown
parameters β0 , β1 , . . . , βp into a (p + 1) vector
 
β0
β1 
 
β =  . , (2.5)
 .. 
βp

we can express the linear regression model as a matrix multiplication

y = Xβ + , (2.6)

where  is a vector of errors/noise.


Learning the unknown parameters β amounts to finding values such that the model fits the data well.
There are multiple ways to define what ‘well’ actually means. We will take a statistical perspective and
choose the value of β which makes the observed training data y as likely as possible under the model—the
so-called maximum likelihood solution.

13
2 The regression problem and linear regression

Example 2.2: Car stopping distances

We will continue Example 2.1, and form the matrices X and y. Since we only have one input and one output,
both xi and yi are scalar. We get,
   
1 4 4
1 5   2 
   
1 5   4 
   
1 5   8 
   
1 5   8 
     
1 7   7 
  β0  
X = 1 7  , β= , y =  7 . (2.7)
  β1  
1 8   8 
   
 .. ..   .. 
. .  . 
  

1 39 138
   
1 39 110
1 40 134

2.3.1 Maximum likelihood


Our strategy to learn the unknown parameters β from the training data T will be the maximum likelihood
method. The word ‘likelihood’ refers to the statistical concept of the likelihood function, and maximizing
the likelihood function amounts to finding the value of β that makes observing y as likely as possible.
That is, we want to solve

maximize p(y | X, β), (2.8)


β

where p(y | X, β) is the probability density of the data y given a certain value of the parameters β. We
denote the solution to this problem—the learned parameters—with β b = [βb0 βb1 · · · βbp ]T . More compactly,
we write this as
b = arg max p(y | X, β).
β (2.9)
β

In order to have a notion of what ‘likely’ means, and thereby specify p(y | X, β) mathematically, we
need to make assumptions about the noise term ε. A common assumption is that ε follows a Gaussian
distribution with zero mean and variance σε2 ,

ε ∼ N 0, σε2 . (2.10)

This implies that the conditional probability density function of the output y for a given value of the input
x is given by

p(y | x, β) = N y | β0 + β1 x1 + · · · + βp xp , σε2 . (2.11)

Furthermore, the n observed training data points are assumed to be independent realizations from this
statistical model. This implies that the likelihood of the training data factorizes as
n
Y
p(y | X, β) = p(yi | xi , β). (2.12)
i=1

Putting (2.11) and (2.12) together we get


n
!
1 1 X
p(y | X, β) = 2 n/2
exp − 2 (β0 + β1 xi1 + · · · + βp xip − yi )2 . (2.13)
(2πσε ) 2σε
i=1

14
2.3 Learning the model from training data

Recall from (2.8) that we want to maximize the likelihood w.r.t. β. However, since (2.13) only depends
on β via the sum in the exponent, and since the exponential is a monotonically increasing function,
maximizing (2.13) is equivalent to minimizing
n
X
(β0 + β1 xi1 + · · · + βp xip − yi )2 . (2.14)
i=1

This is the sum of the squares of differences between each output data yi and the model’s prediction of
that output, ybi = β0 + β1 xi1 + · · · + βp xip . For this reason, minimizing (2.14) is usually referred to as
least squares.
We will come back on how the values βb0 , βb1 , . . . , βbp can be computed. Let us just first mention that
it is also possible—and sometimes a very good idea—to assume that the distribution of ε is something
else than a Gaussian distribution. One can, for instance, assume that ε instead has a Laplace distribution,
which would yield the cost function
n
X
|β0 + β1 xi1 + . . . βp xip − yi |. (2.15)
i=1

It contains the sum of the absolute values of all differences (rather than their squares). The major benefit
with the Gaussian assumption (2.10) is that there is a closed-form solution available for βb0 , βb1 , . . . , βbp ,
whereas other assumptions on ε usually require computationally more expensive methods.

Remark 2.2 With the terminoloy we will introduce in the next chapter, we could refer to (2.13) as the
likelihood function, which we will denote by ` (β).

Remark 2.3 It is not uncommon in the literature to skip the maximum likelihood motivation, and just
state (2.14) as a (somewhat arbitrary) cost function for optimization.

2.3.2 Least squares and the normal equations


By assuming that the noise/error ε has a Gaussian distribution as stated in (2.10), the maximum likelihood
parameters β b are the solution to the optimization problem (2.14). We illustrate this by Figure 2.2, and
write the least squares problem using the compact matrix and vector notation (2.6) as

minimize kXβ − yk22 , (2.16)


β0 ,β1 ,...,βp

where k · k2 denotes the usual Euclidean vector norm, and k · k22 its square. From a linear algebra point
of view, this can be seen as the problem of finding the closest (in an Euclidean sense) vector to y in the
subspace of Rn spanned by the columns of X. The solution to this problem is the orthogonal projection of
y onto this subspace, and the corresponding β b can be shown (Section 2.A) to fulfill

b = XT y.
XT Xβ (2.17)

Equation (2.17) is often referred to as the normal equations, and gives the solution to the least squares
problem (2.14, 2.16). If XT X is invertible, which often is the case, β
b has the closed form

b = (XT X)−1 XT y.
β (2.18)

The fact that this closed-form solution exists is important, and is perhaps the reason why least squares has
become very popular and is widely used. As discussed, other assumptions on ε than Gaussianity leads to
other problems than least squares, such as (2.15) (where no closed-form solution exists).

Time to reflect 2.1: What does it mean in practice that XT X is not invertible?

15
2 The regression problem and linear regression

Model
Data
ε

output y
input x

Figure 2.2: A graphical explanation of the least squares criterion: the goal is to choose the model (blue line) such
that the sum of the square (orange) of each error ε (green) is minimized. That is, the blue line is to be chosen so that
the amount of orange color is minimized. This motivates the name least squares.

Time to reflect 2.2: If the columns of X are linearly independent and p = n − 1, X spans the
entire Rn . That means a unique solution exists such that y = Xβ exactly, i.e., the model fits the
training data perfectly. If that is the case, (2.17) reduces to β = X−1 y, and the model fits the data
perfectly. Why is that not desired?

Example 2.3: Car stopping distances

By inserting the matrices (2.7) from Example 2.2 into the normal equations (2.6), we obtain βb0 = −20.1
and βb1 = 3.1. If we plot the resulting model, it looks like this:

150
Model
Data
Predictions
100
Distance (feet)

50

0
0 10 20 30 40
Speed (mph)

With this model, the predicted stopping distance for x? = 33 mph is yb? = 84 feet, and for x? = 45 mph it is
yb? = 121 feet.

2.4 Nonlinear transformations of the inputs – creating more features

The reason for the word ‘linear’ in the name ‘linear regression’ is that the output is modelled as a linear
combination of the inputs.2 We have, however, not made a clear definition of what an input is: if the speed
is an input, then why could not also the kinetic energy—it’s square—be considered as another input? The
answer is yes, it can. We can in fact make use of arbitrary nonlinear transformations of the “original”
input variables as inputs in the linear regression model. If we, for example, only have a one-dimensional

2
And also the constant 1, corresponding to the offset β 0 . For this reason, affine would perhaps be a better term than linear.

16
2.4 Nonlinear transformations of the inputs – creating more features

Model Model
output y
Data Data

output y
input x input x

(a) The maximum likelihood solution with a 2nd order poly- (b) The maximum likelihood solution with a 4th order poly-
nomial in the linear regression model. As discussed, the line nomial in the linear regression model. Note that a 4th order
is no longer straight (cf. Figure 2.1). This is, however, merely polynomial contains 5 unknown coefficients, which roughly
an artefact of the plot: in a three-dimensional plot with each means that we can expect the learned model to fit 5 data points
feature (here, x and x2 ) on a separate axis, it would still be an exactly (cf. Remark 2.2, p = n − 1).
affine set.

Figure 2.3: A linear regression model with 2nd and 4th order polynomials in the input x, as shown in (2.20).

input x, the vanilla linear regression model is

y = β0 + β1 x + ε. (2.19)

However, we can also extend the model with, for instance, x2 , x3 , . . . , xp as inputs, and thus obtain a
linear regression model which is a polynomial in x,

y = β0 + β1 x + β2 x2 + · · · + βp xp + ε. (2.20)

Note that this is still a linear regression model since the unknown parameters appear in a linear fashion
with x, x2 , . . . , xp as new inputs. The parameters βb are still learned the same way, but the matrix X
is different for model (2.19) and (2.20). We will refer to the transformed inputs as features. In more
complicated settings the distinction between the original input and the transformed features might not be
as clear, and the terms feature and input can sometimes be used interchangeably.

Time to reflect 2.3: Figure 2.3 shows an example of two linear regression models with transformed
(polynomial) inputs. When studying the figure one may ask how a linear regression model can
result in a curved line? Are linear regression models not restricted to linear (or affine) straight
lines? The answer is that it depends on the plot: Figure 2.3(a) shows a two-dimensional plot with
x, y (the ‘original’ input), but a three-dimensional plot with x, x2 , y (each feature on a separate
axis) would still be affine. The same holds true also for Figure 2.3(b) but in that case we would
need a 5-dimensional plot.

Even though the model in Figure 2.3(b) is able to fit all data points exactly, it also suggests that higher
order polynomials might not always be very useful: the behavior of the model in-between and outside the
data points is rather peculiar, and not very well motivated by the data. Higher-order polynomials are for
this reason rarely used in practice in machine learning. An alternative and much more common feature is
the so-called radial basis function (RBF) kernel
 
|x − c|22
Kc (x) = exp − , (2.21)
`

i.e., a Gauss bell centered around c. It can be used, instead of polynomials, in the linear regression model
as
y = β0 + β1 Kc1 (x) + β2 Kc2 (x) + · · · + βp Kcp (x) + ε. (2.22)

17
2 The regression problem and linear regression

This model is can be seen as p ‘bumps’ located at c1 , c2 , . . . , cp , respectively. Note that the locations
c1 , c2 , . . . , cp as well as the length scale ` have to be decided by the user, and only the parameters
β0 , β2 , . . . , βp are learned from data in linear regression. This is illustrated in Figure 2.4. RBF kernels are
in general preferred over polynomials since they have ‘local’ properties, meaning that a small change in
one parameter mostly affects the model only locally around that kernel, whereas a small change in one
parameter in a polynomial model affects the model everywhere.
Example 2.4: Car stopping distances

We continue with Example 2.1, but this time we also add the squared speed as a feature, i.e., the features are
now x and x2 . This gives the new matrices (cf. (2.7))
   
1 4 16 4
1 5 25     2 
   
1 5 25   β 0  4 
  
X = . .. ..  , 
β = β1 ,  y =  . , (2.23)
 .. . .   .. 
  β2  
1 39 1521 110
1 40 1600 134

and when we insert them into the normal equations (2.17), the new parameter estimates are βb0 = 1.58,
βb1 = 0.42 and βb2 = 0.07. (Note that βb0 and βb1 change, compared to Example 2.3.) This new model looks
like

150
Model
Data
Predictions
100
Distance (feet)

50

0
0 10 20 30 40
Speed (mph)

With this model, the predicted stopping distance is now yb? = 87 feet for x? = 33 mph, and yb? = 153 for
x? = 45 mph. This can be compared to Example 2.3, which gives different predictions. Based on the data
alone we can not say that this is the “true model”, but by visually comparing this model with Example 2.3,
this model with more features seems to follow the data slightly better. A systematic method to select between
different features (other than just visually comparing plots) is cross-validation, see Chapter 5.

Model
output y

β2

β4
β1

β3

input x

c1 c2 c3 c4

Figure 2.4: A linear regression model using RBF kernels (2.22) as features. Each kernel (dashed gray lines) is
located at c1 , c2 , c3 and c4 , respectively. When the model is learned from data, the parameters β0 , β1 , . . . , βp are
chosen such that the sum of all kernels (solid blue line) is fitted to the data in, e.g., a least squares sense.

Polynomials and RBF kernels are just two special cases, but we can of course consider any nonlinear
transformation of the inputs. To distinguish the ‘original’ inputs from the ‘new’ transformed inputs, the
term features is often used for the latter. To decide which features to use one approach is to compare

18
2.5 Qualitative input variables

competing models (with different features) using cross-validation; see Chapter 5.

2.5 Qualitative input variables


The regression problem is characterized by a quantitative output3 y, but the nature of the inputs x is
arbitrary. We have so far only discussed the case of quantitative inputs x, but qualitative inputs are perfectly
possible as well.
Assume that we have a qualitative input variable that only takes two different values (or levels or classes),
which we call type A and type B. We can then create a dummy variable x as
(
0 if type A
x= (2.24)
1 if type B

and use this variable in the linear regression model. This effectively gives us a linear regression model
which looks like
(
β0 + ε if type A
y = β0 + β1 x + ε = (2.25)
β0 + β1 + ε if type B

The choice is somewhat arbitrary, and type A and B can of course be switched. Other choices, such as
x = 1 or x = −1, are also possible. This approach can be generalized to qualitative input variables which
take more than two values, let us say type A, B, C and D. With four different values, we create 3 = 4 − 1
dummy variables as
( ( (
1 if type B 1 if type C 1 if type D
x1 = , x2 = , x3 = (2.26)
0 if not type B 0 if not type C 0 if not type D

which, altogether, gives the linear regression model





β0 + ε if type A

β + β + ε
0 1 if type B
y = β0 + β1 x1 + β2 x2 + β3 x3 + ε = (2.27)


β 0 + β 2+ε if type C

β + β + ε
0 3 if type D

Qualitative inputs can be handled similarly in other problems and methods as well, such as logistic
regression, k-NN, deep learning, etc.

2.6 Regularization
Even though the linear regression model at a first glance (cf. Figure 2.1) may seem as a fairly rigid and
non-flexible model, it is not necessarily so. If more features are obtained by extending the model with
nonlinear transformations as in Figures 2.3 or 2.4, or if the number of inputs p is large and the number of
data points n is small, one may experience overfitting. If considering data as consisting of ‘signal’ (the
actual information) and ‘noise’ (measurement errors, irrelevant effects, etc.), the term overfitting indicates
that the model is fitted not only to the ‘signal’ but also to the ‘noise’. An example of overfitting is given in
Example 2.5, where a linear regression model with p = 8 RBF kernels is learned from n = 9 data points.
Even though the model follows all data points very well, we can intuitively judge that the model is not
particularly useful: neither the interpolation (between the data points) nor the extrapolation (outside the
data range) appears sensible. Note that using p = n − 1 is an extreme case, but the conceptual problem
3
If the output variable is qualitative, then we have a classification—and not a regression—problem.

19
2 The regression problem and linear regression

with overfitting is often present also in less extreme situations. Overfitting will be thoroughly discussed
later in Chapter 5.
A useful approach to handle overfitting is regularization. Regularization can be motivated by ‘keeping
the parameters β small unless the data really convinces us otherwise’, or alternatively ‘if a model with
small values of the parameters β fits the data almost as well as a model with larger parameter values,
the one with small parameter values should be preferred’. There are several ways to implement this
mathematically, which leads to slightly different solutions. We will focus on the ridge regression and
LASSO.
For linear regression, another motivation to use regularization is also when XT X is not invertible,
meaning (2.16) has no unique solution β. b In such cases, regularization can be introduced in order to make
X X invertible and give (2.16) a unique solution. However, the concept of regularization extends well
T

beyond linear regression and can be used also when working with other types of problems and models.
For example are regularization-like methods key to obtain a good performance in deep learning, as we will
discuss in Section 7.4.

2.6.1 Ridge regression

In ridge regression (also known as Tikhonov regularization, L2 regularization, or weight decay) the least
squares criterion (2.16) is replaced with the modified minimization problem

minimize kXβ − yk22 + γkβk22 . (2.28)


β0 ,β1 ,...,βp

The value γ ≥ 0 is referred to as regularization parameter and has to be chosen by the user. For γ = 0 we
recover the original least squares problem (2.16), whereas if we let γ → ∞ we will force all parameters
βj to approach 0. A good choice of γ is in most cases somewhere in between, and depends on the actual
problem. It can either be found by manual tuning, or in a more systematic fashion using cross-validation.
It is actually possible to derive a version of the normal equations (2.17) for (2.28), namely

b = XT y,
(XT X + γIp+1 )β (2.29)

where Ip+1 is the identity matrix of size (p + 1) × (p + 1). If γ > 0, the matrix XT X + γIp+1 is always
invertible, and we have the closed form solution

b = (XT X + γIp+1 )−1 XT y.


β (2.30)

2.6.2 LASSO

With LASSO (an abbreviation for Least Absolute Shrinkage and Selection Operator), or equivalently L1
regularization, the least squares criterion (2.16) is replaced with

minimize kXβ − yk22 + γkβk1 , (2.31)


β0 ,β1 ,...,βp

where k · k1 is the Manhattan norm. Contrary to ridge regression, there is no closed-form solution available
for (2.31). It is, however, a convex problem which can be solved efficiently by numerical optimization.
As for ridge regression, the regularization parameter γ has to be chosen by the user also in LASSO:
γ = 0 gives the least squares problem and γ → ∞ gives β = 0. Between these extremes, however, LASSO
and ridge regression will result in different solutions: whereas ridge regression pushes all parameters
β0 , β1 , . . . , βp towards small values, LASSO tends to favor so-called sparse solutions where only a few
of the parameters are non-zero, and the rest are exactly zero. Thus, the LASSO solution can effectively
‘switch some of the inputs off’ by setting the corresponding parameters to zero and it can therefore be used
as an input (or feature) selection method.

20
2.6 Regularization

Example 2.5: Regularization in a linear regression RBF model

We consider the problem of learning a linear re- Model learned with least squares
Data
gression model (blue line) with p = 8 radial basis 1
function (RBF) kernels as features from n = 9 data

output y
points (black dots). Since we have p = n − 1, we 0

can expect the model to fit the data perfectly. How-


ever, as we see in (a) to the right, the model overfits, −1

meaning that the model adapts too much to the data −8 −4 0 4 8


and has a ‘strange’ behavior between the data points. input x
As a remedy to this, we can use ridge regression (b)
or LASSO (c). Even though the final models with (a) The model learned with least squares (2.16).
ridge regression and LASSO look rather similar, Even though the model follows the data exactly, we
should typically not be satisfied with this model:
their parameters β b are different: the LASSO solu-
neither the behavior between the data points nor
tion effectively only makes use of 5 (out of 8) radial
outside the range is plausible, but is only an effect
basis functions. This is referred to as a sparse solu-
of overfitting, in that the model is adapted ‘too well’
tion. Which approach should be preferred depends,
of course, on the specific problem. to the data. The parameter values β b are around 30
and −30.

Model learned with ridge regression Model learned with LASSO


Data Data
1 1
output y

output y

0 0

−1 −1

−8 −4 0 4 8 −8 −4 0 4 8
input x input x

(b) The same model, this time learned with ridge (c) The same model again, this time learned with
regression (2.28) with a certain value of γ. Despite LASSO (2.31) with a certain value of γ. Again,
not being perfectly adapted to the training data, this this model is not perfectly adapted to the training
model appears to give a more sensible trade-off data, but appears to have a more sensible trade-off
between fitting the data and avoiding overfitting between fitting the data and avoiding overfitting
than (a), and is probably more useful in most than (a), and is probably also more useful than (a)
situations. The parameter values βb are now roughly in most situations. In contrast to (b), however, 3
evenly distributed in the range from −0.5 to 0.5. (out of 9) parameters in this model are exactly 0,
and the rest are in the range from −1 to 1.

2.6.3 General cost function regularization

Ridge Regression and LASSO are two popular special cases of regularization for linear regression. They
both have in common that they modify the cost function, or optimization objective, of (2.16). They can be
seen as two instances of a more general regularization scheme

minimize V (β, X, y) + γ R(β) . (2.32)


β | {z } | {z }
data fit model
flexibility
penalty

Note that (2.32) contains three important elements: (i) one term which describes how well the model fits
to data, (ii) one term which penalizes model complexity (large parameter values), and (iii) a trade-off
parameter γ between them.

21
2 The regression problem and linear regression

2.7 Further reading

Linear regression has now been used for well over 200 years. It was first introduced independently by
Adrien-Marie Legendre in 1805 and Carl Friedrich Gauss in 1809 when they discovered the method of least
squares. The topic of linear regression is due to its importance described in many textbooks in statistics
and machine learning, such as Bishop (2006), Gelman et al. (2013), Hastie, Tibshirani, and Friedman
(2009), and Murphy (2012). While the basic least squares technique has been around for a long time, its
regularized versions are much younger. Ridge regression was introduced independently in statistics by
Hoerl and Kennard (1970) and in numerical analysis under the name of Tikhonov regularization. The
LASSO was first introduced by Tibshirani (1996). The recent monograph by Hastie, Tibshirani, and
Wainwright (2015) covers the development related to the use of sparse models and the LASSO.

2.A Derivation of the normal equations

The normal equations (2.17)

b = XT y.
XT Xβ

can be derived from (2.16)

b = argmin kXβ − yk2 ,


β 2
β

in different ways. We will present one based on (matrix) calculus and one based on geometry and linear
algebra.
No matter how (2.17) is derived, if XT X is invertible, it (uniquely) gives

b = (XT X)−1 XT y,
β

If XT X is not invertible, then (2.17) has infinitely many solutions β,


b which all are equally good solutions
to the problem (2.16).

2.A.1 A calculus approach

Let

V (β) = kXβ − yk22 = (Xβ − y)T (Xβ − y) = yT y − 2yT Xβ + β T XT Xβ, (2.33)

and differentiate V (β) with respect to the vector β,


V (β) = −2XT y + 2XT Xβ. (2.34)
∂β

Since V (β) is a positive quadratic form, its minimum must be attained at ∂β



V (β) = 0, which characterizes
the solution β as
b

∂ b = 0 ⇔ −2XT y + 2XT Xβ = 0 ⇔ XT Xβ
V (β) b = XT y, (2.35)
∂β

i.e., the normal equations.

22
2.A Derivation of the normal equations

2.A.2 A linear algebra approach


Denote the p + 1 columns of X as cj , j = 1, . . . , p + 1. We first show that kXβ − yk22 is minimized if β
is chosen such that Xβ is the orthogonal projection of y onto the (sub)space spanned by the columns cj
of X, and then show that the orthogonal projection is found by the normal equations.
Let us decompose y as y⊥ + yk , where y⊥ is orthogonal to the (sub)space spanned by all columns ci ,
and yk is in the (sub)scpace spanned by all columns ci . Since y⊥ is orthogonal to both yk and Xβ, it
follows that

kXβ − yk22 = kXβ − (y⊥ + yk )k22 = k(Xβ − yk ) − y⊥ k22 ≥ ky⊥ k22 , (2.36)

and the triangle inequality also gives us

kXβ − yk22 = kXβ − y⊥ − yk k22 ≤ ky⊥ k22 + kXβ − yk k22 . (2.37)

This implies that if we choose β such that Xβ = yk , the criterion kXβ − yk22 must have reached its
minimum. Thus, our solution βb must be such that Xβ
b − y is orthogonal to the (sub)space spanned by all
columns ci , i.e.,
b T cj = 0, j = 1, . . . , p + 1
(y − Xβ) (2.38)

(remember that two vectors u, v are, by definition, orthogonal if their scalar product, uT v, is 0.) Since the
columns cj together form the matrix X, we can write this compactly as

b T X = 0,
(y − Xβ) (2.39)

where the right hand side is the p + 1-dimensional zero vector. This can equivalently be written as
b = XT y,
XT Xβ

i.e., the normal equations.

23
3 The classification problem and three parametric
classifiers
We will now study the classification problem. Whereas the regression problem has quantitative outputs,
classification is the situation with qualitative outputs. A method that performs classification is referred to
as a classifier. Our first classifier will be logistic regression, and we will in this chapter also introduce
the linear and quadratic discriminant analysis classifiers (LDA and QDA, respectively). More advanced
classifiers, such as classification trees, boosting and deep learning, will be introduced in the later chapters.

3.1 The classification problem


Classification is about predicting a qualitative output from p inputs of arbitrary types (quantitative and/or
qualitative). Since the output is qualitative, it can only take values from a finite set. We use K to denote
the number of elements in the set of possible output values. The set of possible output values can, for
instance, be {false, true} (K = 2) or {Sweden, Norway, Finland, Denmark} (K = 4). We will refer to
these elements as classes or labels. The number of classes K is assumed to be known throughout this
text. To prepare for a concise mathematical notation, we generically use integers 1, 2, . . . , K to denote the
output classes. The integer labeling of the classes is arbitrary, and we use it only for notational convenience.
The use of integers does not mean there is any inherent ordering of the classes.
When there are only K = 2 classes, we have the important special case of binary classification. In
binary classification, we often use the labels1 0 and 1 (instead of 1 and 2). Occasionally, we will also use
the terms positive (class k = 1) and negative (class k = 0) as well. The reason for using different choices
for the two labels in binary classification is purely for mathematical convenience.
Classification amounts to predicting the output from the input. In our statistical approach, we understand
classification as the problem of predicting class probabilities

p(y | x), (3.1)

where y is the output (1, 2, . . . , or K) and x is the input. Note that we use p(y | x) to denote probability
masses (y qualitative) as well as probability densities (y quantitative). In words, p(y | x) describes the
probability for the output y (a class label) given that we know the input x. This probability will be a
cornerstone from now on, so we will first spend some effort to understand it well. Talking about p(y | x)
implies that we think about the class label y as a random variable. Why? Because we choose to model
the real world, from where the data originates, as involving a certain amount of randomness (cf. ε in
regression). Let us illustrate with an example:

1
In Chapter 6 we will use k = 1 and k = −1 instead.

25
3 The classification problem and three parametric classifiers

Example 3.1: Modeling voting behavior—randomness in the class label y

If we are to describe voting preferences (= y, the qualitative output) among different population
groups (= x, the input), we have to face that all people in a certain population group will not
vote for the same political party. To describe this mathematically, we can think of y as a random
variable which follows a certain probability distribution. If we knew that the vote count in the
group of 45 year old women (= x) is 13% for the cerise party, 39% for the turquoise party and
48% for the purple party, we could describe it as

p(y = cerise party | 45 year old women) = 0.13,


p(y = turqoise party | 45 year old women) = 0.39,
p(y = purple party | 45 year old women) = 0.48.

In this way, we use probabilities p(y | x) to describe the non-trivial fact that

(a) not all 45 year old women vote for the same party, but

(b) the choice of party does not appear to be completely random among 45 year old women
either; the purple party is the most popular, and the cerise party is the least popular.

The number of output classes in this example is K = 3.

3.2 Logistic regression


For binary classification (y is either 0 or 1), learning a classifier amounts to learning a model of p(y = 1 | x)
and p(y = 0 | x). Since, by the laws of probability, we have that p(y = 1 | x) + p(y = 0 | x) = 1, it is
sufficient to learn a model for only one of the class probabilities, say p(y = 1 | x), from which p(y = 0 | x)
will follow.
From the previous chapter, we have the linear regression model. If we slightly change the notation from
Chapter 2 and let x = [1 x1 x2 . . . xp ]T (i.e., include 1 in the first position), we can write
z = β0 + β1 x1 + β2 x2 + · · · + βp xp = β T x. (3.2)
The input to this function is x, and the output, here denoted by z, takes values on the entire real line. Note
that we have skipped the noise ε. In classification we are interested in p(y = 1 | x), which however is a
function of x which takes values only within the interval [0, 1] (since p, in this context, is a probability
mass function). The key idea underlying logistic regression is thus to ‘squeeze’ the output from linear
regression z into the interval [0, 1] by using the logistic function (Figure 3.1)
ez
h(z) = . (3.3)
1 + ez
Since the logistic function is limited to take values between 0 and 1, we obtain altogether a function from
x to [0, 1], which we can use as a model for p(y = 1 | x),
T
eβ x
p(y = 1 | x) = T . (3.4a)
1 + eβ x

Note that this implicitly also gives us that


T
eβ x 1
p(y = 0 | x) = 1 − T = T . (3.4b)
1+ eβ x 1 + eβ x

We now have a model for p(y = 1 | x) and p(y = 0 | x), which contains unknown parameters β that can
be learned from training data. That is, we have constructed a binary classifier, which is called logistic
regression.

26
3.2 Logistic regression

0.8

0.6

h(z)
0.4

0.2

0
−10 −8 −6 −4 −2 0 2 4 6 8 10

Figure 3.1: The logistic function h(z) = ez


1+ez .

Remark 3.1 Despite its name, logistic regression is a method for classification, not regression! The
(confusing) name is due only to historical reasons.

3.2.1 Learning the logistic regression model from training data


By using the logistic function, we have transformed linear regression, a regression method, into logistic
regression, a classification method. The price to pay is that we will not be able to use the handy normal
equations for learning β in logistic regression (as we could for linear regression). Just as in linear
regression, we want to learn β from training data T = {(xi , yi )}ni=1 using the maximum likelihood
approach, that is, solving
b = arg max ` (β) ,
β (3.5)
β

where ` (β) is the likelihood function


` (β) , p(y | X; β). (3.6)
Let us now work out a detailed expression for the likelihood function2 ,
n
Y Y Y
` (β) = p(y | X; β) = p(yi | xi ; β) = p(y = 1 | xi ; β) p(y = 0 | xi ; β) =
i=1 i:yi =1 i:yi =0
Y T
e β xi Y 1
= T T . (3.7)
i:yi =1 1 + eβ xi
i:yi =0 1 + eβ xi

This is the function which we would like to optimize with respect to β, cf. (3.5). For numerical reasons,
it is often better to optimize the logarithm of ` (β) (since the logarithm is a monotone function, the
maximizing argument is the same),
X   T
 X  T

log ` (β) = β T xi − log 1 + eβ xi − log 1 + eβ xi
i:yi =1 i:yi =0
X n  T

= yi β T xi − log 1 + eβ xi
. (3.8)
i=1

The simplification in the second equality relies on the chosen labeling, that yi = 0 or yi = 1, which is
indeed the reason for why this labeling is convenient.
A necessary condition for the maximum of log ` (β) is that its gradient is zero,
n
!
X T
e β xi
∇β log ` (β) = xi yi − = 0. (3.9)
1 + e β T xi
i=1
2
We now add β to the expression p(y | x), to explicitly show its dependence also on β.

27
3 The classification problem and three parametric classifiers

Note that this equation is vector-valued, i.e., we have a system of p + 1 equations to solve (with p + 1
unknown elements of the vector β). Contrary to the linear regression model (with Gaussian noise) in
Section 2.3.1, this maximum likelihood problem results in a nonlinear system of equations, lacking a
general closed form solution. Instead, we are forced to use a numerical solver, as discussed in Appendix B.
The standard choice is to use the Newton–Raphson algorithm (equivalent to the so-called iteratively
reweighted least squares algorithm), see e.g. Hastie, Tibshirani, and Friedman 2009, Chapter 4.4.

Algorithm 1: Logistic regression for binary classification


Data: Training data {xi , yi }ni=1 (with output classes y = 0, 1 and test input x?
Result: Predicted test output yb?
Learn
1 Compute βb by solving (3.9) numerically.
Predict
2 Compute p(y = 1 | x? ) (3.4a) and p(y = 0 | x? ) (3.4b).
3 If p(y = 1 | x? ) > p(y = 0 | x? ) set yb? ← 1, otherwise set yb? ← 0.

3.2.2 Decision boundaries for logistic regression


So far, we have presented logistic regression as a method for modeling the class probabilities p(y = 0 | x)
and p(y = 1 | x). However, if we want to use logistic regression for actually making a prediction for the
test input x? , i.e., deciding whether we believe y? = 0 or y? = 1, we have to add a final step. First, we
learn β from training data, thereafter compute p(y = 0 | x? ) and p(y = 1 | x? ), and finally we let the
prediction yb? be the most probable class (we will give a motivation for this later in Section 3.4.1),

yb? = arg max p(y = k | x? ). (3.10)


k=0,1

This is illustrated in Figure 3.2 for a one-dimensional input x.


A classifier associates all points in the space of possible test inputs x? to a prediction y? . Most often,
the classifier forms certain regions which belong to the same prediction. The boundary between those
regions, that is, the curve which separates different class predictions from each other, is called the decision
boundary. A decision boundary for logistic is illustrated by Figure 3.2 for a one-dimensional input case,
and in Figure 3.3 for two two-dimensional input cases.
We can find the decision boundary by solving the equation

p(y = 1 | x) = p(y = 0 | x) (3.11)

which with logistic regression gives


T
eβ x 1 T
T = T ⇔ eβ x
= 1 ⇔ β T x = 0. (3.12)
1+ eβ x 1+ eβ x

The equation β T x = 0 parameterizes a (linear) hyperplane. Hence, the decision boundaries in logistic
regression always have the shape of a (linear) hyperplane.
We distinguish between different types of classifiers by the shape of their decision boundary: Since
logistic regression only has linear decision boundaries, it is consequently called a linear classifier.

28
3.2 Logistic regression

decision
yb? = 0 boundary yb? = 1

0.8
p(y = 1 | x? )

0.6

0.4

0.2
p(y = 0 | x? )
0
−15 −10 −5 0 5 10 15

x?

Figure 3.2: Consider binary classification (y = 0 or 1) when the input x is scalar (horizontal axis). Once β is
learned from training data (not shown), logistic regression gives us a model for p(y = 1 | x? ) (blue) and p(y = 0 | x? )
(red) for any test input x? . To turn these modeled probabilities into actual class predictions (b
y? is either 0 or 1), the
class which is modeled to have the highest probability is taken as the prediction. The point(s) where the prediction
changes from from one class to another is called the decision boundary (dashed vertical line).
x2

x2

x1 x1

(a) Logistic regression for K = 2 classes always gives a (b) Logistic regression for K = 3 classes. We have now
linear decision boundary. The red dots and green circles introduced training data from a third class, marked with
are training data from different classes, and the intersection blue crosses. The decision boundary between any pair of
between the red and green fields is the decision boundary two classes is still linear.
obtained for the logistic regression classifier learned from
the training data.

Figure 3.3: Examples of decision boundaries for logistic regression.

29
3 The classification problem and three parametric classifiers

3.2.3 Logistic regression for more than two classes


Logistic regression can be used also if there are more than two classes, K > 2. There are several ways of
generalizing logistic regression to the multi-class problem, and we will follow one path which also will be
useful later in deep learning (Chapter 7). This generalization requires two steps: first, we will introduce
the one-hot encoding, and second, we will replace the logistic function with a softmax function.
Let us start with introducing the one-hot encoding. Instead of letting y be an integer value in the range
{1, . . . , K} (“vanilla encoding”), we replace the output yi with a K-dimensional vector yi . If the original
output is k, then the k:th entry of yi is 1, and the remaining entries zero. For K = 3, it would look like
Vanilla encoding One-hot encoding
 T
yi = 1 yi = 1 0 0
 T
yi = 2 yi = 0 1 0
 T
yi = 3 yi = 0 0 1 .
Since we now have a vector-valued output y, we also need a vector-valued alternative to the logistic
function. To this end, we introduce the vector-valued softmax function
 z 
e1
 z2 
1 e 
softmax(z) , PK  . , (3.13)
ezj  ..  j=1
e zK
to which z is a K-dimensional input vector z = [z1 z2 . . . zK ]. The softmax function has the following
properties: The sum of its output vector is always 1, and each element is always in the interval [0, 1].
Similarly to how we combined linear regression and the logistic function for the binary classification
problem (3.4), we now combine linear regression and the softmax function to model the class probabilities,
   T 
p(1 | xi ) β 1 xi
 p(2 | xi )   β T xi 
   2 
 ..  = softmax(z), where z =  ..  , (3.14)
 .   . 
p(K | xi ) βT
K xi
or equivalently
T
eβk xi
p(k | xi ) = PK . (3.15)
e βTl xi
l=1
This is our multi-class logistic regression. Note that we in this construction use K vectors β 1 , . . . , β K
(one for each class), meaning that the number of parameters to learn grows with K. As for binary logistic
regression, we can learn those parameters using the maximum likelihood idea. We use θ to denote all
parameters in β 1 , . . . , β K . With the one-hot encoding, the likelihood function takes the particular form
n
X
log ` (θ) = log p(y | X; θ) = log p(yi | xi ; θ) =
i=1
X X X
= log p(1 | xi ; θ) + log p(2 | xi ; θ) + · · · + log p(K | xi ; θ) =
i:yi =1 i:yi =2 i:yi =K
n X
X K
= yik log p(k | xi ; θ). (3.16)
i=1 k=1

where yik are the elements of the one-hot encoding vectors. We will not pursue any more details here, but
similarly to the binary case this likelihood function can also be used as objective function in numerical
optimization. The particular form of (3.16) will appear every time we use the one-hot encoding, and is
sometimes referred to as cross-entropy.

30
3.3 Linear and quadratic discriminant analysis (LDA & QDA)

Time to reflect 3.1: The softmax solution is actually slightly over-parameterized, compared to
binary logistic regression (3.4). That is not a problem in practice, but consider the case K = 2
and see if you can spot it!

Remark 3.2 The one-hot encoding will later be useful in deep learning. We will, however, not use it
for all multiclass methods. LDA, QDA and k-NN, for example, all use the vanilla encoding also for the
multi-class problem.

3.3 Linear and quadratic discriminant analysis (LDA & QDA)


We will now introduce two other classifiers, namely linear and quadratic discriminant analysis (LDA3 and
QDA, respectively). In logistic regression, we used linear regression and the logistic function to model
p(y | x). In LDA and QDA, we instead make the assumption that p(x | y) is a Gaussian distribution. As
it turns out, this will give us a classifier which is easy to learn (requires no numerical optimization, in
contrast to logistic regression), and is useful in practice also when the Gaussian assumption about p(x | y)
is not met.

3.3.1 Using Gaussian approximations in Bayes’ theorem


From probability theory, Bayes’ theorem might be familiar, which says that
p(x | y)p(y)
p(y | x) = PK . (3.17)
k=1 p(x | k)p(k)

The left hand side, p(y | x), is our core interest in classification. In a practical machine learning problem,
neither the left nor the right hand side of (3.17) is known to us; we only have training data, and no one
provides us with any equations. In logistic regression, we went straight to the left hand side and modeled
that as (3.4). In LDA and QDA, instead, we focus on the right hand side, by assuming that p(x | y) has
a Gaussian distribution (no matter what the data actually looks like). Since this is now a distribution
over the input4 x, and x usually has more than one dimension, p(x | y) has to be a multivariate Gaussian
distribution with a mean vector and a covariance matrix.
Of course, we want a classifier which learns something from the training data. That is done by learning
the parameters of the Gaussian distribution, the mean vector µ and the covariance matrix Σ, from the
training data. In LDA the mean vector µ is assumed to be different for each class, but the covariance
matrix Σ is assumed to be the same for all classes. In QDA, on the other hand, are both the mean vector
µ and the covariance matrix Σ assumed to be different for each class. Since the mean vectors and the
covariance matrices will be learned from data, we will append them with a hat symbol, µ b and Σ.
b
The right hand side of Bayes’ theorem (3.17) also includes the factor p(y), which usually is unknown as
well. The meaning of this term is the probability that a completely random data point has label y (without
knowing its inputs x). As an approximation of p(y), the occurrence of class k in the training data, denoted
bk , is used p(y) ≈ p(k). For example, if 22% of the training data has label 1, we approximate p(1) as
π
b1 = 0.22, etc.
π
Thus, in LDA p(y | x) is modeled as
 
bk N x? | µ
π bk, Σ b
p(y = k | x? ) = P  , (3.18)
K
b
π j N x ? | b
µ j , b
Σ
j=1

for k = 1, 2, . . . , K. This is what is shown in Figure 3.4.


3
Note to be confused with Latent Dirichlet Allocation, which is a completely different machine learning method.
4
TODO: This actually assumes that x is quantiative. How are qualitative inputs handled in LDA/QDA?

31
3 The classification problem and three parametric classifiers

x2

x2
Level curves for p(x | y = 0) Level curves for p(x | y = 0)
Level curves for p(x | y = 1) Level curves for p(x | y = 1)
Training data points xi with label yi = 0 Training data points xi with label yi = 0
Training data points xi with label yi = 1 Training data points xi with label yi = 1

x1 x1

(a) In LDA, it is assumed that the input x, for a given (b) Also in QDA, it is assumed that the input x, for a
output y, is distributed as a Gaussian distribution. The given output y, is distributed as a Gaussian distribution,
mean is different for each class y, but the covariance is the but for QDA both the mean and the covariance can differ
same for all classes. This plot shows how we assume that between the classes. This plot shows how we assume that
the training data looks like, when we derive LDA. the training data looks like, when we derive QDA.

Figure 3.4: LDA and QDA is derived by assuming that p(x | y) has a Gaussian distribution. This means that we
think about the input variables as random and assume that they have a certain distribution. In LDA (left panel, a),
we assume that the covariance of the input distribution (shape of the level curves) is the same for all classes, and
they only differ in locations. QDA (right panel, b) assumes that the covariance can be different for different classes.
In fact, when using LDA and QDA in practice, these assumptions on how the inputs x are distributed are rarely
satisfied, but this is nevertheless the way we motivate the methods.

In full analogy, for QDA we have (the only difference is the covariance matrix Σ)
b
 
bk N x? | µ
π bk, Σ bk
p(y = k | x? ) = P  . (3.19)
K
b
πj N x ? | b
µ j , b
Σ j
j=1

Note that we have not made any restriction on K, but we can use LDA and QDA for binary as well as
multi-class classification. We will now discuss some aspects in more detail.

3.3.2 Using LDA and QDA in practice

We have derived LDA and QDA by studying Bayes’ theorem (3.17). We are ultimately interested in the
left hand side of (3.17), and we went there by making an assumption about the right hand side, namely
that p(x | y) has a Gaussian distribution. In most practical cases that assumption does not hold in reality
(or, at least, it is hard for us to verify whether it holds or not), but LDA as well as QDA turns out to be
useful classifiers even when that assumption does not hold.
How do we go about in practice, if we want to learn an LDA or QDA classifier from training data
{xi , yi }ni=1 (without knowing something about the real distribution p(x | y)) and use it to make a prediction?

Learning the parameters

First, the parameters π


bk , µ
b k and Σ
b (for LDA) or Σ
b k (for QDA) have, for each k = 1, . . . , K, to be learned,
or estimated, from the training data. The perhaps most straightforward parameter to learn is π bk , the relative
occurrence of class k in the training data,
nk
bk =
π , (3.20a)
n

32
3.3 Linear and quadratic discriminant analysis (LDA & QDA)

where nP
k is the number of training data samples in class k. Consequently, all nk must sum to n, and
thereby k πbk = 1. Further, the mean vector µk of each class is learned as

1 X
bk =
µ xi , (3.20b)
nk
i:yi =k

the empirical mean among all training samples of class k. For LDA, the common covariance matrix Σ for
all classes is usually learned as
K
1 X X
b =
Σ b k )T
b k )(xi − µ
(xi − µ (3.20c)
n−K
k=1 i:yi =k

which can be shown to be an unbiased estimate of the covariance matrix5 . For QDA, one covariance
matrix Σk has to be learned for each class k = 1, . . . , K, usually as
1 X
bk =
Σ b k )T ,
b k )(xi − µ
(xi − µ (3.20d)
nk − 1
i:yi =k

which similarly also can be shown to be an unbiased estimate.

Remark 3.3 To derive the learning of LDA and QDA, we did not make use of the maximum likelihood
idea, in contrast to linear and logistic regression. Furthermore, learning LDA and QDA amounts to
inserting the training data into the closed-form expressions (3.20), similar to linear regression (the normal
equations), but different from logistic regression (which requires numerical optimization).

Making predictions

Once we have learned the parameters π bk , µ


b k and Σ
b or Σ b k for all classes k = 1, . . . , K, we have a model
for p(y | x) (3.18 and 3.19) we can use it for making predictions for a test input x? . As for logistic
regression, we turn p(y | x? ) into actual predictions yb? by taking the most probable class as the prediction,

yb? = arg max p(y = k | x? ). (3.21)


k

We summarize this by algorithm 2 and 3, and illustrate by Figure 3.5 and 3.6.

Algorithm 2: Linear Discriminant Analysis, LDA


Data: Training data {xi , yi }ni=1 (with output classes k = 1, . . . , K) and test input x?
Result: Predicted test output yb
Learn
1 for k = 1, . . . , K do
2 Compute π bk (3.20a) and µ
b k (3.20b)
3 end
4 Compute Σ b (3.20c)
Predict
5 for k = 1, . . . , K do
6 Compute p(y = k | x? ) (3.18)
7 end
8 Find largest p(y = k | x? ) and set yb? to that k

5
This means that the if we estimate Σ
b like this for new training data over and over again, the average would be the true covariance
matrix of p(x).

33
3 The classification problem and three parametric classifiers

Algorithm 3: Quadratic Discriminant Analysis, QDA


Data: Training data {xi , yi }ni=1 (with output classes k = 1, . . . , K) and test input x?
Result: Predicted test output yb?
Learn
1 for k = 1, . . . , K do
2 Compute π bk (3.20a), µ
b k (3.20b) and Σ
b k (3.20d)
3 end
Predict
4 for k = 1, . . . , K do
5 Compute p(y = k | x? ) (3.19)
6 end
7 Find largest p(y = k | x? ) and set yb? to that k

p(x | y) p(y)

0.6
y=1 y=2 y=3
0.6 b2
π
0.4
0.4
b1
b b b
π
σ σ σ
0.2 b3
0.2 π

0 0
−3 b1
µ −1 b2
µ 1 b3
µ 3 y=1 y=2 y=3

Bayes’ theorem
| y)p(y)
p(y | x) = Í Kp(xp(x | k)p(k)
k=1

p(y | x)

1
y=1 y=2 y=3

0.5

0
−3 −1 1 3
x

Figure 3.5: An illustration of LDA for K = 3 classes, with dimension p = 1 of the input x. In the upper left panel
is the Gaussian model of p(x | k) shown, parameterized by µ b k and Σ.
b The parameters µ b k and Σ,
b as well as πbk , are
learned from training data, not shown in the figure. (Since p = 1, we only have a scalar variance Σ2 , instead of a
covaraince matrix Σ). In the upper right panel is π bk , an approximation of p(k), shown. These are used in Bayes’
theorem to compute P (k | x), shown in the bottom panel. We take the final prediction as the class which is modeled
to have the highest probability, which means the topmost solid colored line in the bottom plot (e.g., the prediction for
x? = 0.7 would be yb = 2 (green)). The decision boundaries (vertical dotted lines in the bottom plot) are hence
found where the solid colored lines are intersecting.

34
3.3 Linear and quadratic discriminant analysis (LDA & QDA)

p(x | y) p(y)
y=2

1 0.6 b2
π

y=1 0.4
b2
σ b1
π
0.5
0.2 b3
π
b1
σ y=3

0 b3
σ 0
−3 b1
µ −1 b2
µ 1 b3
µ 3 y=1 y=2 y=3

Bayes’ theorem

p(y | x)

1
y=1 y=2 y=3

0.5

0
−3 −1 1 3
x

Figure 3.6: An illustration of QDA for K = 3 classes, in the same fashion as Figure 3.5. However, in contrast to
LDA in Figure 3.5, is the learned variance Σb k of p(x | k) different for different k (upper left panel). For this reason
can the resulting decision boundaries (bottom panel) be more complicated than for LDA, note for instance the small
slice of yb = 3 (blue) inbetween yb = 1 (red) and yb = 2 (green) around −0.5.

Decision boundaries for LDA and QDA

Once we have learned the parameters from training data, we can compute yb? for a test input x? by inserting
everything into (3.18) for each class k, and take the prediction as the class which is predicted to have the
highest probability p(y | x). As it turns out, the equations (3.18) and (3.19) are simple enough so that we
can, by only using pen and paper, say something about the decision boundary, i.e., the boundary (in the
input space) where the predictions shift between different classes.

35
3 The classification problem and three parametric classifiers

If we note that neither the logarithm nor terms independent of k change the location of the maximizing
argument (arg maxk ), we can for LDA write

ybLDA = arg max p(y = k | x) =


k
= arg max log p(y = k | x) =
k
 
  XK  
= arg max log πk + log N x | µ b − log 
bk, Σ b =
bj, Σ
πj N x | µ
k j=1
 
b =
bk, Σ
= arg max log πk + log N x | µ
k
1 b − 1 (x − µ b −1 (x − µ
= arg max log πk − log det 2π Σ b k )T Σ bk) =
k 2 2
1 T b −1 b −1 µ
= arg max log πk − µbk Σ µ b k + xT Σ bk . (3.22)
k | 2 {z }
,δkLDA (x)

The function δkLDA (x) on the last row is sometimes referred to as the discriminant function. The points x on
the boundary between two class predictions, say k = 0 and k = 1, is characterized by δ0LDA (x) = δ1LDA (x),
i.e., the decision boundary between two classes 0 and 1 can be written as the set of points x which fulfills

δ0LDA (x) = δ1LDA (x) ⇔ (3.23)


1 b −1 T b −1 1 T b −1
log π0 − µT
0 Σ µ0 + x Σ µ0 = log π1 − µ b Σ µ b 1 + xT Σ−1 µ
b1 ⇔
2 2 1
1  T −1 
b −1 b 0 .
xT Σ−1 (b
µ0 − µb 1 ) = log π
b1 − log π
b0 − µ1 Σ µb1 − µbT
0Σ µ (3.24)
| 2 {z }
constant (independent of x)

From linear algebra, we know that {x : xT A = c}, for some matrix A and some constant c, defines a
hyperplane in the x-space. Thus, the decision boundary for LDA is always linear, and hence its name,
linear discriminant analysis.
For QDA we can do a similar derivation
1 b k − 1 µT Σ
b −1 b k + xT Σ
b −1 µ 1 T b −1
ybQDA = arg max log πk − log det Σ k k µ k b k − x Σk x . (3.25)
k | 2 2 {z 2 }
,δkQDA (x)

and set δ0QDA (x) = δ1QDA (x) to find the decision boundary as the set of points x for which

1 b0 − 1µ b −1 b 0 + xT Σb −1 µ 1 T b −1
b0 −
log π log det Σ bT
0 Σ0 µ 0 b 0 − x Σ0 x
2 2 2
1 b 1 T b −1 T b −1 1 b −1
= log π1 − log det Σ1 − µ1 Σ1 µ b 1 + x Σ1 µ b 1 − xT Σ 1 x
2 2 2
b −1 µ
⇔ xT ( Σ b −1 b 1 ) − 1 xT (Σb −1 + Σ b −1 )x
0 b 0 − Σ1 µ 0 1
2
1  
b1 − log π
= log π b0 − log det Σ b 1 + 1 log det Σb0 − 1 µ bT b −1 µ
Σ b 1 − b
µ T b −1
Σ b
µ 0 , (3.26)
1 1 0 0
| 2 2 {z 2 }
constant (independent of x)

This is now on the format {x : xT A + xT Bx = c}, a quadratic form, and the decision boundary for
QDA is thus always quadratic (and thereby also nonlinear!), which is the reason for its name quadratic
discriminant analysis.

36
3.3 Linear and quadratic discriminant analysis (LDA & QDA)

x2

x2
x1 x1

(a) LDA for K = 2 classes always gives a linear decision (b) LDA for K = 3 classes. We have now introduced
boundary. The red dots and green circles are training data training data from a third class, marked with blue crosses.
from different classes, and the intersection between the The decision boundary between any two pair of classes is
red and green fields is the decision boundary obtained for still linear.
an LDA classifier learned from the training data.
x2

x2

x1 x1

(c) QDA has quadratic (i.e., nonlinear) decision boundaries, (d) With K = 3 classes are the decision boundaries for
as in this example where a QDA classifier is learned from QDA possibly more complex than with LDA, as in this
the shown training data. case (cf. (b)).

Figure 3.7: Examples of decision boundaries for LDA and QDA, respectively. This can be compared to Figure 3.3,
where the decision boundary for logistic regression (with the same training data) is shown. LDA and logistic
regression both have linear decision boundaries, but they are not identical.

37
3 The classification problem and three parametric classifiers

p(y | x)
p(y | x)

0.5
0.5

0
y=0 y=1 0
y=1 y=2 y=3 y=4

(a) With K = 2 classes, Bayes’ classifier tell us to


(b) For K = 4 classes, Bayes’ classifier tells us to take the prediction
take the class which has probability > 0.5 as the
yb as the highest bar, which means yb = 4 here. (In contrast to K = 2
prediction yb. Here, the prediction would therefore
classes in (a), it can happen that no class has probability > 0.5.)
be yb = 1.

Figure 3.8: Bayes’ classifier: The probabilities p(y | x) are shown as the height of the bars. Bayes’ classifier says
that if we want to make as few misclassificiations as possible, on the average, we should predict yb as the class which
has highest probability.

3.4 Bayes’ classifier — a theoretical justification for turning p(y | x) into yb


We have introduced logistic regression, LDA and QDA as three different classifiers. You have probably
noted that they are all constructed as different ways of modeling p(y | x), and we will now reason about
an optimal (but only hypothetical!) classifier, referred to as the Bayes’ classifier. In most cases, Bayes’
classifier can not be implemented (we usually do not have enough information), but it tells us how we
should convert p(y | x) into actual predictions, and gives a coherent framework for the classifiers.

3.4.1 Bayes’ classifier

Assume that we want to design a classifier which, on the average, makes as few misclassification errors as
possible. That means that the predicted output label yb should equal the true output label y for as many test
data points as possible. If we knew the probabilities p(y | x) exactly (in logistic regression, LDA, QDA,
and all other classifiers, we only have a model—a guess—for p(y | x), we never know it exactly), then the
optimal classifier is given by
yb = arg max p(y = k | x? ). (3.27)
k

Or, in words, the optimal classifier predicts yb as the label which has the highest probability given the input
x. The optimal classifier, equation (3.27), is the Bayes’ classifier. This is illustrated by Figure 3.8. Let us
first show why this is optimal, and then discuss how it connects to the other classifiers.

3.4.2 Optimality of Bayes’ classifier

Making, on the average, as many correct predictions as possible means that we want yb (which is not
random, but we can choose it ourselves) to be as likely as possible to equal y (which is random). How
likely yb is to equal y can be expressed mathematically using the expected value over the distribution for the
random variable y, which we write as Ey∼p(y | x) . Using the indicator function I{} (one when its argument
is true, otherwise zero) we can write

K
X
Ey∼p(y | x) [I{b
y = y}] = I{b
y = k}p(y = k | x) = p(b
y | x), (3.28)
k=1

where we used the definition of expected value in the first step, and ignored all terms equal to zero in the
second step. In order to make this quantity as large as possible, we can see that we should select yb such
that p(b
y | x) is as large as possible, which also is what we claimed in (3.27).

38
3.4 Bayes’ classifier — a theoretical justification for turning p(y | x) into yb

3.4.3 Bayes’ classifier in practice: useless, but a source of inspiration


Bayes’ classifier (3.27) makes use of the probability distribution p(y | x), and thereby also assume that we
know it. If we happen to have a problem where we actually know p(y | x), we can of course use it, and we
should look no further for other methods, since Bayes’ classifier is optimal. In most machine learning
problems, however, we do not know p(y | x). In fact, the entire point of machine learning is that we know
very little about how y depends on x, other than what the training data tells us!
This does, however, not mean that Bayes’ classifier is a useless concept. In fact, most of the classifiers
in these notes can be understood as various approximations of Bayes’ classifier, or put differently, that
we have methods to learn (or rather estimate) p(y | x) from the training data. Even though we have not
introduced all methods yet, we give a brief overview of how some classifiers relate to (3.27):

• In binary logistic regression p(y | x) is modeled as



p(y = 1 | x) = exp(βT Tx)
1+exp(β x)
(3.29)
p(y = 0 | x) = 1
T
1+exp(β x)

• In linear and quadratic discriminant analysis (LDA and QDA), p(y | x) is computed by Bayes’
theorem (3.17), in which p(x | y) is assumed to be a Gaussian distribution (with mean and variance
learned from the training data) and p(y) is taken as the empirical distribution of the training data.

• In k-nearest neighbor (k-NN), p(y | x) is modeled as the empirical distribution in the k-nearest
samples in the training data.

• In tree-based methods, p(y | x) is modeled as the empirical distribution among the training data
samples in the same leaf node.

• In deep learning, p(y | x) is modeled using a deep neural network and a softmax function.

All these classifiers use different ways to model/learn/approximate p(y | x), and the default choice6 is
thereafter to predict yb according to Bayes’ classifier (3.27), that is, pick the prediction as the class y which
is modeled to have the highest probability p(y | x). With only two classes this means that the prediction is
taken as the class which is modeled to have probability > 0.5.

3.4.4 Is it always good to predict according to Bayes’ classifier?


Even though Bayes’ classifier usually is not available to us in practice (we only have the data, and it does
not tell us what p(y | x) looks like), it can still be used as an argument when turning a modeled/learned
approximation of p(y | x) into a prediction yb. In fact, we have already done so throughout this chapter
without questioning it further. Does Bayes’ classifier mean that we should always choose the prediction yb
as the class which our model assigns the highest probability? No. Even though (3.27) is a sensible default
option to explore, we should be aware of the following facts

• Bayes’ classifier is optimal only if the goal is to make as few misclassifications as possible. It might
sound as an obvious goal, but that is not always the case! If we are to predict the health status of a
patient, falsely predicting yb =‘well’ might be much more severe than falsely predicting yb =‘bad’
(or, perhaps, vice versa?). Sometimes the goal is therefore asymmetric, and Bayes’ classifier (3.27)
is not optimal for such situations.

• Bayes’ classifier is guaranteed to be optimal only if we know p(y | x) exactly. However, when
we only have an approximation of p(y | x), it is not guaranteed that (3.27) is the best thing to do
anymore.
6
Sometimes this is not very explicit in the method, but if you look carefully, you will find it.

39
3 The classification problem and three parametric classifiers

3.5 More on classification and classifiers


3.5.1 Linear and nonlinear classifiers
A regression model which is linear in its parameters is called linear regression (Chapter 2). For the
classification problem, the term “linear” is used differently; a linear classifier is a classifier whose decision
boundary (for the problem with K = 2 classes) is linear, and a nonlinear classifier is a classifier which can
have a nonlinear decision boundary. Among the classifiers introduced in this chapter, logistic regression
and LDA are linear classifiers, whereas QDA is a nonlinear classifier, cf. Figure 3.3 and 3.7. Note that
even though logistic regression and LDA both are linear classifiers, their decision boundaries are not
identical. All classifiers that will follow in the subsequent chapters, except for decision stumps (Chapter
6), will be nonlinear.
As for linear regression (Section 2.4), it is possible to include nonlinear transformation of the inputs
to create more features. With such transformations can the (seemingly inflexible?) linear classifier
obtain rather complicated decision boundaries. It requires, however, the manual crafting and selection of
nonlinear transformations. Instead, a more often used (and importantly more automatic) approach to build
a complicated classifier from a simple one is boosting, which is introduced in Chapter 6.

3.5.2 Regularization
As with linear regression (Section 2.6), overfit might be a problem if n (the number of training data
samples) is not much bigger than p (the number of inputs). We will define and discuss overfitting in more
detail in Chapter 5. However, regularization can be useful also in classification to avoid overfit. A common
regularization approach for logistic regression is a Ridge Regression-like penalty for β, cf. (2.28). For
LDA and QDA, it can be useful to regularize the covariance matrix estimation ((3.20d) and (3.20c)).

3.5.3 Evaluating binary classifiers


An important use of binary classification, i.e. K = 2, is to detect the presence of something, such as a
disease, an object on the radar, etc. The convention is to let y = 1 (”positive”) denote presence, and y = 0
(”negative”) denote absence. Such applications have the important characteristics that

(i) Most of the data is usually y = 0, meaning that a classifier which always predicts yb = 0 might
score well if we only care about the number of correct classifications (accuracy). Indeed, a medical
support system which always predicts ”healthy” is probably correct most of the time, but nevertheless
useless.

(i) A missed detection (predicting yb = 0, when in fact y = 1) might have much more sever consequences
than a false detection (predicting yb = 1, when in fact y = 0).

For such classification problems, there is a set of analysis tools and terminology which we will introduce
now.

40
3.5 More on classification and classifiers

Ratio Name
FP/N False positive rate, Fall-out, Probability of false alarm
TN/N True negative rate, Specificity, Selectivity
TP/P True positive rate, Sensitivity, Power, Recall, Probability of detection
FN/P False negative rate, Miss rate
TP/P* Positive predictive value, Precision
FP/P* False discovery rate
TN/N* Negative predictive value
FN/N* False omission rate
P/n Prevalence
(TN+TP)/n Accuracy

Table 3.1: Common terminology related to the quantities (TN, FN, FP, TP) in the confusion matrix.

Confusion matrix

If one learns a binary classifier and evaluates it on a test dataset, a simple yet useful way to visualize the
result is a confusion matrix. By separating the test data in four groups depending on y (the actual output)
and yb (the output predicted by the classifier), we can make the following table

y=0 y=1 total


yb = 0 True neg (TN) False neg (FN) N*
yb = 1 False pos (FP) True pos (TP) P*
total N P n
Of course, TN, FN, FP, TP (and also N*, P*, N, P and n) should be replaced by the actual numbers, as will
be seen in the next example. There is also a wide body of terminology related to the confusion matrix,
which is summarized in Table 3.1.
The confusion matrix provides a quick and informative overview of the characteristics of a classifier.
Depending on the application, it might be important to distinguish between false positive (FP, also called
type I error) and false negative (FN, also called type II error). Ideally they both should be 0, but that is
rarely the case in practice.
With the Bayes’ classifier as a motivation, our default choice has been to convert p(y = 1 | x) into
predictions as (
if p(y = 1 | x) ≥ t let yb = 1
(3.30)
if p(y = 1 | x) < t let yb = 0
with t = 0.5 as a threshold. If we, however, are interested in decreasing the false positive rate (at the
expense of an increased false negative rate), we may consider to raise the threshold t, and vice versa.

ROC curve

As suggested by the example above, the tuning of the threshold t in (3.30) can be crucial for the performance
in binary classification. If we want to compare different classifiers (say, logistic regression and QDA) for a
certain problem beyond the specific choice of t, the ROC curve can be useful. The abbreviation ROC
means ”receiver operating characteristics”, and is due to its history from communications theory.
To plot an ROC curve, the true positive rate (TP/P) is drawn against the false positive rate (FP/N) for
all values of t ∈ [0, 1]. The curve typically looks as shown in Figure 3.9. An ROC curve for a perfect
classifier (always predicting the correct value with full certainty) touches the upper left corner, whereas a
classifier which only assigns random guesses gives a straight diagonal line.
A compact summary of the ROC curve is the area under the ROC curve, AUC. From Figure 3.9, we
conclude that a perfect classifier has AUC 1, whereas a classifier which only assigns random guesses has
AUC 0.5.

41
3 The classification problem and three parametric classifiers

Example 3.2: Confusion matrix in thyroid disease detection

The thyroid is an endocrine gland in the human body. The hormones it produces influences the
metabolic rate and the protein synthesis, and thyroid disorders may have serious implications. We
consider the problem of detecting thyroid diseases, using the dataset provided by UCI Machine
Learning Repository (Dheeru and Karra Taniskidou 2017). The dataset contains 7200 data points,
each with 21 medical indicators as inputs (both qualitative and quantitative). It also contains the
qualitative diagnosis {normal, hyperthyroid, hypothyroid}, which we convert into the binary
problem with only {normal, not normal} as outputs. The dataset is split into a training and test
part, with 3772 and 3428 samples respectively. We train a logistic regression classifier on the
training dataset, and use it for predicting the test dataset (using the default t = 0.5), and obtain the
following confusion matrix:
y = normal y = not normal
yb = normal 3177 237
yb = not normal 1 13
Most test data points are correctly predicted as normal, but a large part of the not normal data is
also falsely predicted as normal. This might indeed be undesired in the application.
To change the picture, we change the threshold to t = 0.15, and obtain new predictions with the
following confusion matrix instead:
y = normal y = not normal
yb = normal 3067 165
yb = not normal 111 85
This change gives a significantly better true positive rate (85 instead of 13 patients are correctly
predicted as not normal), but this happens at the expense of a worse false positive rate (111,
instead of 1, patients are now falsely predicted as not normal). Whether it is a good trade-off
depends, of course, on the specifics of the application: which type of error has the most severe
consequences?
For this problem, only considering the total accuracy (misclassification rate) would not be very
informative. In fact, the useless predictor of always predicting normal would give an accuracy of
almost 93%, whereas the second confusion matrix above corresponds to an accuracy of 92%, even
though it probably would probably be much more useful in practice.

0.8
g t →
sin
rea
True positive rate

0.6
inc

0.4

Typical example
Perfect classifier
0.2

Random guess
0
0 0.2 0.4 0.6 0.8 1

False positive rate

Figure 3.9: ROC curve

42
4 Non-parametric methods for regression and
classification: k-NN and trees
The methods (linear regression, logistic regression, LDA and QDA) we have encountered so far all have a
fixed set of parameters. The parameters are learned from the training data, and once the parameters are
learned and stored, the training data is not used anymore and could be discarded. Furthermore, all those
methods have had a fix structure; if the amount of training data increases the parameters can be estimated
more accurately, with smaller variance, but the flexibility or expressiveness of the model does not increase;
logistic regression can only describe linear decision boundaries, no matter how much training data that is
available.
There exists another class of methods, not relying on a fixed structure and set of parameters, but which
adapts more to the training data. Two methods in this class, which we will encounter now, are k-nearest
neighbors (k-NN) and tree-methods. They can both be used for classification as well as regression, but we
will focus our presentation on the classification problem.

4.1 k-NN
The name k-nearest neighbors (k-NN) is almost self-explanatory. To approximate p(y | x? ), the proportion
of the different classes among the k number of training data points closest to x? is used. The value k is
a user-chosen integer ≥ 1 which controls the properties of the classifier, as discussed below. Formally,
we define the set R? = {i : xi is one of the k training data points closest to x? } and can then express the
classifier as
1 X
p(y = j|x? ) = I{yi = j} (4.1)
k
i∈R?

for j = 1, 2, . . . , K. Following Bayes’ classifier and always predict the class which has the largest
probability, k-NN simply amounts to a majority vote among the k nearest training data points. We explain
this by the example on the next page.
Note that in contrast to the previous methods that we have discussed there is no parametric model to be
trained, with which we then compute the prediction yb. Therefore, we talk about k-NN as a non-parametric
method. Instead, yb depends on the training data in a more direct fashion. The k-NN method can be
summarized in the following algorithm.

Algorithm 4: k-nearest neighbor, k-NN


Data: Training data {xi , yi }ni=1 (with output classes 1, . . . , K) and test input x?
Result: Predicted test output yb
1 Find the k training data point(s) xi which has the shortest Euclidian distance kxi − x? k to x?
2 Decide yb with a majority vote among those k nearest neighbors

43
4 Non-parametric methods for regression and classification: k-NN and trees

Example 4.1: Predicting colors with k-NN

We are given a training data set with n = 6 observations of p = 2 input variables x1 , x2 and one
(qualitative) output y, the color Red or Blue,
i x1 x2 y
1 −1 3 Red
2 2 1 Blue
3 −2 2 Red
4 −1 2 Blue
5 −1 0 Blue
6 1 1 Red

and we are interested in predicting the output for x? = [1 2]T . For this purpose, we will explore
two different k-NN classifiers, one using k = 1 and one using k = 3.
First, we compute the Euclidian distance kxi − x? k between each training data point xi and the
test data point x? , and then sort them in descending order.
4
k=3

1
i kxi − x? k yi

i=
√ k=1
6 √1 Red
i=
2
2 Blue
4
i=3
√2
x2

i=6
4 √4 Blue
i=2
1 √5 Red
5 Blue
0
√8 i=5
3 9 Red
−2 0 2
x1

Since the closest training data point to x? is the data point i = 6 (Red), it means that for k-NN
with k = 1, we get the model p(Red | x? ) = 1 and p(Blue | x? ) = 0. This gives the prediction
yb? = Red.
Further, for k = 3, the 3 nearest neighbors are i = 6 (Red), i = 2 (Blue), and i = 4 (Blue),
which gives the model p(Red | x? ) = 13 and p(Blue | x? ) = 32 . The prediction, which also can be
seen as a majority vote among those 3 training data points, thus becomes yb? = Blue.
This is also illustrated by the figure above where the training data points xi are represented with
red squares and the blue triangles depending on which class they belong to. The test data point x?
is represented with a black filled circle. For k = 1 the closest training data point is identified by
the inner circle and for k = 3 the three closest points are identified by the outer circle.

4.1.1 Decision boundaries for k-NN

In Example 4.1 we only computed a prediction for one single test data point x? . If we would shift that
test point by one step to the left at xalt
? = [0 2] the three closest training data points would still include
T

i = 6 and i = 2 but now i = 2 is exchanged for i = 1. For k = 3 this would give the approximation
p(Red|x? ) = 23 and we would predict yb = Red. In between these two test data points x? and xalt ? at
[0.5 2]T it is equally far to i = 1 as to i = 2 and this point would consequently be located at the decision
boundary between the two classes. Continuing this way of reasoning we can sketch the full decision
boundaries in Example 4.1 which are displayed in Figure 4.1. Obviously, k-NN is not restricted to linear
decision boundaries and is therefore a nonlinear classification method.

44
4.1 k-NN

4 4

2 2
x2

x2
0 0

−2 0 2 −2 0 2
x1 x1

(a) k = 1 (b) k = 3

Figure 4.1: Decision boundaries for the problem in Example 4.1 for the two choices of the parameter k.

4.1.2 Choosing k
The user has to decide on which k to use in k-NN and this decision has a big impact on the final classifier.
In Figure 4.2 another scenario is illustrated with p = 2 input variables, K = 3 classes and significantly
more training data samples. In the two subfigures the decision boundaries for a k-NN classifier with k = 1
and k = 11 are illustrated.
By definition, with k = 1 all training data points are classified correctly and the boundaries are more
adapted to what the training data exactly looks like (including its ’noise’ and other random effects). With
the averaging procedure that takes place for the k-NN classifier with k = 11, some training data points
end up in the wrong region and the decision boundaries are less adapted to this specific realization of
the training data. Even though k-NN with k = 1 fits all training data points perfectly, the one with
k = 11 might be preferred, since it is less prone to overfit to the training data meaning there are good
reasons to believe this model would perform better on test data. A systematic way of choosing k is to use
cross-validation, and we will discuss these aspects more in Chapter 5.
x2

x2

x1 x1

(a) Decision boundary for k-NN with k = 1 for a 3-class (b) Decision boundary for k-NN with k = 11. A more
problem. A complex, but possibly also overfitted, decision rigid and less flexible decision boundary.
boundary.

Figure 4.2: Decision boundaries for k-NN.

4.1.3 Normalization
Finally, one practical aspects crucial for the k-NN worth mentioning is the importance of normalization
of the input data. Since k-NN is based on the Euclidean distances between points, it is important that
these distances are a valid measure of the closeness between two data points. Imagine a training data

45
4 Non-parametric methods for regression and classification: k-NN and trees

set with two input variables xi = [xi1 , xi2 ]T where all values of xi1 are in the range [0, 100] and
the values for xi2 in the much smaller range [0, 1]. This could for example be the case if xi1 and xi2
represent different physical quantities (where the values can be quite different, depending on which
unit is used). pIn this case the Euclidean distance between a test point x? and a training data point
kxi − x? k = (xi1 − x?1 )2 + (xi2 − x?2 )2 would almost only depend on the first term (xi1 − x?1 )2
and the values of the second component xi2 would have a small impact.
One way to overcome this problem is to divide the first component with 100 and create xnew
i1 = xi1 /100
such that both components are in the range [0, 1]. More generally, this normalization procedure for the
input data can be written as

xpi − min(xij )
xnew
ij = , ∀j = 1, . . . , p, i = 1, . . . , n. (4.2)
max(xij ) − min(xij )

Another popular way of normalizing is by using the mean and standard deviation in the training data:
xij − x̄j
xnew
ij = , ∀j = 1, . . . , p, i = 1, . . . , n, (4.3)
σj

where x̄j and σj are the mean and standard deviation for each input variable, respectively.

4.2 Trees
Tree-based methods divides the input space into different regions. Within each region, p(y | x) is modeled
as the empirical distribution among the training data samples in the same region. The rules to divide the
input space can be summarized in a tree, and hence these methods are known as decision trees. Trees can
be used for both regression and classification problems. Here, our description will focus on classification
trees.

4.2.1 Basics
In a classification tree the function p(y | x) is modeled with a series of rules on the input variables
x1 , . . . , xp . These rules can be represented by a binary tree. This tree effectively divides the input space
into multiple regions and in each region a constant value for the predicted class probability p(y | x) is
assigned. We illustrate this with an example.

46
4.2 Trees

Example 4.2: Predicting colors with a classification tree

Consider a problem with two input variables x1 and x2 and one quantitative output y, the color
red or blue. A classification tree for this problem can look like the one below. To use this tree
to classify a new point x? = [x?1 , x?2 ]T we will start at the top and work the way down until
we reach the end of a branch. Each such final branch corresponds to a constant predicted class
probability p(Red|x? ).
5.0
x2 < 3.0

R2 R3

x2
x1 < 5.0
R1
p(Red|x? ) = 0 3.0
R1
R2 R3
1
p(Red|x? ) = 3 p(Red|x? ) = 1 x1

A classification tree. At each internal node a rule on A region partition of the classification tree. Each
the form xj < sk indicates the left branch coming region corresponds to a leaf node in the classifi-
from that split and the right branch then consequently cation tree to the left, and each border between
corresponds to xj ≥ sk . This tree as two internal regions correspond to a split in the tree. Each
nodes and three leaf nodes. region is marked with the color representing the
highest predicted class probability.
A pseudo code for classifying a test input with the tree above would look like

if x_2 < 3.0 then


return p(Red|x)=0
else
if x_1 < 5.0 then
return p(Red|x)=1/3
else
return p(Red|x)=1
end
end

As an example if we have x? = [2.5, 3.5]T , in the first split we would take the right branch since
x?2 = 3.5 ≥ 3.0 and in the second split we would take the left branch since x?1 = 2.5 < 5.0.
Consequently, the for this test point we would get p(Red|x? ) = 1/3 and hence p(Blue|x? ) = 2/3.
This classification tree can also be represented as splitting the input space into multiple rectacle-
shaped regions. This is illustrated in the left figure above.

To set the terminology, the endpoint of each branch R1 , R2 and R3 in the example are called leaf nodes
and the internal splits, x2 < 3.0 and x1 < 5.0 are known as internal nodes. The lines that connect the
nodes are referred to as branches. Note that in an example with more than two input variables, the region
partition (right figure in the example) is difficult to draw, but the tree will work out in the same way with
exactly two branches coming out of each internal node.
This example illustrates how a classification tree can be used to make predictions, but how do we learn
the tree from training data? This will be explained in the next section.

47
4 Non-parametric methods for regression and classification: k-NN and trees

4.2.2 Training a classification tree


More mathematically, the classification tree models the class probability p(k|x) as a constant cmk in each
region Rm and for each class k = 1, 2, . . . , K:
M
X
p(y = k|x) = cmk I{x ∈ Rm }, (4.4)
m=1

where M is the total number of regions (leaf nodes) in the tree and where I{x ∈ Rm } = 1 if x ∈ Rm
and
PK0 otherwise. Since the probabilities should sum up to 1 in each region we also have the constraint
k=1 cmk = 1.
The overall goal in constructing a classification tree based on training data {xi , yi }ni=1 is to find a tree
that makes the observed training data as likely as possible. This approach is known as the maximum
likelihood method, which we also used to derive a solution to the logistic regression problem previously.
Maximizing the likelihood is equivalent of minimizing the negative logarithm of the likelihood.
Therefore, we want to find a tree T which minimizes the following expression:

− log `(T ) = − log p(y|X, T )


Xn
=− log p(yi |xi )
i=1
Xn XK
=− I{yi = k} log p(k|xi ). (4.5)
i=1 k=1

By inserting the model stated in (4.4) into (4.5), we get


n X
X K X
M
− log `(T ) = − log cmk I{yi = k}I{xi ∈ Rm }
i=1 k=1 m=1
XM XK X
1
=− nm log cmk I{yi = k}
nm
m=1 k=1 i:xi ∈Rm
| {z }
π
bmk
M
X K
X
=− nm bmk log cmk .
π (4.6)
m=1 k=1

Here πbmk is the proportion of training data points in region Rm that are from class k with nm being the
total number of training data points in region m. We can show1 that
K
X K
X K K
πmk X X
− bmk log cmk =
π bmk log
π − bmk ≥ −
bmk log π
π bmk log π
π bmk . (4.7)
cmk
k=1 k=1 k=1 k=1
| {z }
≥0

which is fulfilled with equality if cmk = π bmk . Hence, minimizing (4.6) with respect to cmk gives
cmk = πbmk . It remains to find the regions Rm , and based on the discussion above we want to select the
regions in order to minimize,
M
X K
X
min nm Qm (T ), where Qm (T ) = − bmk log π
π bmk (4.8)
m=1 k=1

1
We use the so called log sum inequality and the two constraints cmk = 1 and bmk = 1 for all m = 1, . . . , M .
PK PK
k=1 k=1 π

48
4.2 Trees

is known as the entropy for region m.2


Finding the best tree T that minimizes (4.8) is, unfortunately, a combinatorial problem and hence
computationally infeasible. Instead, we choose a greedy algorithm known as recursive binary splitting
which does the minimization for each node split separately (instead of optimizing the entire tree at the
same time). This approach starts in the top of the tree and successively splits the input where each split
divides one branch into two new branches, and builds a tree similar to what we saw in the example above.
This approach is greedy since it builds the tree by introducing only one split at a time, without having the
full tree ’in mind’.
Consider the setting when we are about to do our first split into a pair of half-planes

R1 (j, s) = {x|xj ≤ s} and R2 (j, s) = {x|xj > s}.

The split depends on the index j of the input variable at which the split is performed and the cutpoint s.
The corresponding proportions πbmk will also depend on j and s.
1 X 1 X
b1k (j, s) =
π I{yi = k}, b2k (j, s) =
π I{yi = k}.
n1 n2
i:xi ∈R1 (j,s) i:xi ∈R2 (j,s)

We seek the splitting variable j and cutpoint s that solve


" K
! K
!#
X X
min n1 − π b1k (j, s) + n2 −
b1k (j, s) log π b2k (j, s) log π
π b2k (j, s) . (4.9)
j,s
k=1 k=1

For each input variable we can scan through the finite number of possible splits and pick the pair (j, s)
which minimizes (4.9). After that, we repeat the process to create new splits by finding the best values
(j, s) for each of the new branches. We continue the process until some stopping criteria is reached, for
example until no region contains more than five training data points.
The tree in Example 4.2 has been constructed based on the methodology outlined above, which we will
illustrate in the example below.

Example 4.3: Learning a classification tree (continuation of Example 4.2)

We consider the same setup as in Example 4.2 with the following dataset
10

x1 x2 y
9.0 2.0 Blue 8

1.0 4.0 Blue


4.0 6.0 Blue 6
4.0 1.0 Blue
x2

1.0 2.0 Blue 4


1.0 8.0 Red
6.0 4.0 Red
7.0 9.0 Red
2

9.0 8.0 Red


9.0 6.0 Red 0
0 2 4 6 8 10
x1

We want to learn a classification tree, by using the entropy criteria in (4.8) and growing the tree
until there are no regions with more than five data points left.
First split: There are infinitely many possible splits we can make, but all splits which gives the
same partition of the data points will be the same. Hence, in practice we only have nine different

2
If any π
bmk is equal to 0, the term 0 log 0 is taken to be zero, which is also consistent with the limit limr→0+ r log r = 0.

49
4 Non-parametric methods for regression and classification: k-NN and trees

splits to consider in this dataset. The data and these splits (dashed lines) are visualized in the figure
above.
We consider all nine splits in turn. We start with the split at x1 = 2.5 which splits the input
space into the two regions R1 = x1 < 2.5 and R2 = x1 ≥ 2.5. In region R1 we have two blue
data points and one red, in total n1 = 3 data points. The proportion of the two classes in region R1
will therefore be πb1B = 2/3 and π b1R = 1/3. The entropy is calculated as
2 2 1 1
Q1 (T ) = −b π1B ) − π
π1B log(b π1R ) = − log( ) − log( ) = 0.64.
b1R log(b (4.10)
3 3 3 3
In region R2 we have n2 = 7 data points with the proportions π
b2B = 3/7 and π
b2R = 4/7. The
entropy for this regions will be
3 3 4 4
Q2 (T ) = −b π2B ) − π
π2B log(b π2R ) = − log( ) − log( ) = 0.68
b2R log(b (4.11)
7 7 7 7
and the total weighted entropy for this split becomes

n1 Q1 (T ) + n2 Q2 (T ) = 3 · 0.64 + 7 · 0.68 = 6.69. (4.12)

We compute the cost for all other splits in the same manner, and summarize it in the table below.

Split (R1 ) n1 b1B


π b1R
π Q1 (T ) n2 b2B
π b2R
π Q2 (T ) n1 Q1 (T ) + n2 Q2 (T )
x1 < 2.5 3 2/3 1/3 0.64 7 3/7 4/7 0.68 6.69
x1 < 5.0 5 4/5 1/5 0.50 5 1/5 4/5 0.50 5.00
x1 < 6.5 6 4/6 2/6 0.64 4 1/4 3/4 0.56 6.07
x1 < 8.0 7 4/7 3/7 0.68 3 1/3 2/3 0.64 6.69
x2 < 1.5 1 1/1 0/1 0.00 9 4/9 5/9 0.69 6.18
x2 < 3.0 3 3/3 0/3 0.00 7 2/7 5/7 0.60 4.18
x2 < 5.0 5 4/5 1/5 0.50 5 1/5 4/5 0.06 5.00
x2 < 7.0 7 5/7 2/7 0.60 3 0/3 3/3 0.00 4.18
x2 < 8.5 9 5/9 4/9 0.69 1 0/1 1/1 0.00 6.18

From the table we can read that the two splits at x2 < 3.0 and x2 < 7.0 are both equally good. We
choose to continue with x2 < 3.0.

After first split After second split


10 10

8 8

6 6 R2 R3
x2

x2

4 4

2 2
R1 R1
0 0
0 2 4 6 8 10 0 2 4 6 8 10
x1 x1

Second split: We notice that only R2 has more than five data points. Also there is no point
splitting region R1 further since it only contains data points from the same class. In the next
step we therefore split the second region into two new regions R2 and R3 . All possible splits are
displayed above to the left (dashed lines) and we compute their cost in the same manner as before.

50
4.2 Trees

Splits (R1 ) n2 b2B


π b2R
π Q2 (T ) n3 b3B
π b3R
π Q3 (T ) n2 Q2 (T ) + n3 Q3 (T )
x1 < 2.5 2 1/2 1/2 0.69 5 1/5 4/5 0.50 3.89
x1 < 5.0 3 2/3 1/3 0.63 4 0/4 4/4 0.00 1.91
x1 < 6.5 4 2/4 2/4 0.69 3 0/3 3/3 0.00 2.77
x1 < 8.0 5 2/5 3/5 0.67 2 0/2 2/2 0.00 3.37
x2 < 5.0 2 1/2 1/2 0.69 5 1/5 4/5 0.50 3.88
x2 < 7.0 4 2/4 2/4 0.69 3 0/3 3/3 0.00 2.77
x2 < 8.5 6 2/6 4/6 0.64 1 0/1 1/1 0.00 3.82

The best split is the one at x1 < 5.0 visualized above to the right. The final tree and partition were
displayed in Example 4.2. None of the three regions has more than five data points. Therefore, we
terminate the training.
If we want to use the tree for prediction, we get p(Red|x? ) = π
b1R = 0 if x? falls into region R1 ,
p(Red|x? ) = π b2R = 1/3 if falls into region R2 or p(Red|x? ) = π b3R = 1 if falls into region R3 , in
the same manner as displayed in Example 4.2.

4.2.3 Other splitting criteria

There are other splitting criteria that can be considered instead of the entropy (4.8). One simple alternative
is the misclassification rate

Qm (T ) = 1 − max π
bmk , (4.13)
k

which is simply the proportion of data points in region Rm which do not belong to the most common class.
This sounds like a reasonable choice since it is often the misclassification rate that we use to evaluate the
final classifier. However, one drawback with is that it does not favor pure nodes in the same extent as the
entropy criteria does. With pure nodes we mean nodes where most data points belong to a certain class. It
is usually an advantage to favor pure nodes in the greedy procedure that we use to grow the tree, since it
can lead to a total of fewer splits.
For example, consider the first split in Example 4.3. If we would use the misclassification rate as
splitting criteria, both the split x2 < 5.0 as well as x2 < 3.0 would provide a total misclassification rate
of 0.2. However, the split at x2 < 3.0, which the entropy criteria favored, provides a pure node R1 . If
we would go with the split x2 < 5.0 the misclassification after the second split would still be 0.2. If we
would continue to grow the tree until no data points are misclassified we would need three splits if we
used the entropy criteria whereas we would need five splits if we would use the misclassification criteria
and started with the split at x2 < 5.0.
Another common splitting criteria is the Gini index

K
X
Qm (T ) = bmk (1 − π
π bmk ). (4.14)
k=1

Similar to the entropy criteria, Gini index favors node purity more than misclassification rate does.
If we consider two classes where r is the proportion in the second class, the three criteria are

Misclassification rate: Qm (T ) = 1 − max(r, 1 − r)


Gini index: Qm (T ) = 2r(1 − r)
Entropy/deviance: Qm (T ) = −r log r − (1 − r) log(1 − r)

These functions are shown in Figure 4.3. We can se that the entropy and Gini index are quite similar.

51
4 Non-parametric methods for regression and classification: k-NN and trees

Misclassification rate
0.4 Gini index
Entropy

0.2

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
r

Figure 4.3: Three splitting criteria for classification trees as a function of the proportion in class 2. The entropy
criteria has been scaled such that it passes through (0.5,0.5).

4.2.4 Regression trees


Trees can also be used for regression. Instead of the aforementioned splitting criteria we typically use the
mean squared error as splitting criteria
1 X 1 X
Qm (T ) = (yi − ybm )2 , where ybm = yi . (4.15)
nm nm
j∈Rm i∈Rm

This splitting criteria can also be motivated from a maximum likelihood point of view as we did for the
classification trees. At prediction time the mean ybm in each region is used for prediction. In all other
aspects the procedure to train a regression tree is the same as training a classification tree as explained
above.

52
5 How well does a method perform?
So far we have studied different methods of how to learn models by adapting them to training data. We
hope that the models thereby will give us good predictions also when faced with new, previously unseen,
data. But can we really expect that to work? This may first sound like a trivial question, but on second
thought it is perhaps not (so) obvious anymore, and we will give it the attention it deserves in this chapter.
By doing so, we will find some interesting concepts, which will give us practical tools for evaluating and
understanding supervised machine learning methods better.

5.1 Expected new data error Enew : performance in production


We start by introducing some concepts and notation. First, we define an error function E(b y , y) encoding
the purpose of classification or regression. The error function compares a prediction yb to a measured
data point y, and returns a small value (possibly zero) if yb is a good prediction of y, and a larger value
otherwise: The worse the prediction, the larger the value. There are many different error functions that
could be considered, but our choices in this chapter will be:
(
0 if yb = y
E(by , y) , (classification) (5.1a)
1 if yb 6= y
E(b y − y)2
y , y) , (b (regression) (5.1b)

The error function E(b y , y) has similarities to a loss function. However, they are used differently: A loss
function is used to train (or learn) a model, whereas the error function is used to analyze performance of
an already trained model.
In the end every machine learning mostly cares about how a method performs when faced with an
endless stream of new, unseen data. Imagine for example all real-time recordings of street views that
have to be processed by a vision system in a self-driving car once it is sold to a customer, or all new
patients that have to be classified by a medical diagnosis system. The performance on fresh unseen data
can in mathematical terms be understood as the average of the error function—how often the classifier is
right, or how good does the regression method predict. To be able to mathematically describe the endless
stream of new data, we introduce a distribution over data p(x, y). In the previous chapters, we have
mostly considered the output y as a random variable whereas the inputs x have been thought of as fixed.
Now, we have to think of also the input x as a random variable with a certain probability distribution. In
any real-world machine learning scenario p(x, y) can be extremely complicated and really hard or even
impossible to model. That is, however, not a problem since we will only use p(x, y) to reason about
machine learning methods, and the bare notion of p(x, y) will be helpful for that.

53
5 How well does a method perform?

Remark 5.1 In Section 3.4 about Bayes’ classifier, we made a hypothetical argument about the optimal
classifier, if we had access to p(y | x) (which we usually do not have). The arguments in this section are
made from an even more hypothetical point of view, assuming that we do know not only p(y | x), but also
p(x), since p(x, y) = p(y | x)p(x). While this is often an unrealistic assumption, the reasoning will lead
us to useful insights about when and how we can expect machine learning to work.

Irrespective of which classification or regression method we are considering, once the model has been
trained on training data T = {xi , yi }ni=1 , it will provide us with predictions yb? for any new input x? we
give to it. We will in this chapter write yb(x; T ) as a function of x and T , like yb? , yb(x? ; T ), to indicate
that the prediction (via the model) depends both on the value of the test input x? and on the training data
T used to train the model.
In the previous chapters, we have mostly discussed how a model predicts one, or a few, test inputs x? .
Let us now take that to the next level, by integrating the error function (5.1) over all possible test data
points (rather than considering only one or a few) with respect to the distribution p(x, y). We refer to this
as the expected new data error

Enew , E? [E(b
y (x? ; T ), y? )] , (5.2)

where the expectation E? is the expectation over all possible test data points with respect to the distribution
(x? , y? ) ∼ p(x, y), that is,
Z
E? [E(by (x? ; T ), y? )] = E(b
y (x? ; T ), y? )p(x? , y? ) dx? dy? . (5.3)

We emphasize that the model (no matter whether it is linear regression, a decision tree, a neural network
or something else) is trained on a given training data set T and represented by yb(·; T ). What is happening
in equation (5.2) is an averaging over possible test data points (x? , y? ). Thus, Enew describes how well the
model generalizes from the training data T to new situations.
The expected new data error Enew tells us how well a method performs when we put it into production;
what proportions of predictions a classifier will get right, and how well a regression method will predict in
terms of average squared error. Or, in a more applied setting, what rate of false and missed detections of
pedestrians we can expect a vision system in a self-driving car to make, or how big a proportion of all
future patients a medical diagnosis system will get wrong.

The overall goal in supervised machine learning is to achieve as small Enew as possible.

Unfortunately, in practical cases we can never compute Enew to assess how well we are doing. The reason
is that p(x, y)—which we do not know in practice—is part of the definition of Enew . It seems, however, to
be a too important construction to be abandoned, just because we cannot compute it. We will instead
spend the remaining parts of this chapter trying to estimate Enew (essentially by replacing the integral with
a sum), and also understanding how Enew behaves, to better understand how we can decrease it.

Remark 5.2 Note that Enew is a property of a trained model and a specific machine learning problem.
Thus, we cannot talk about “Enew for QDA” in general, but instead we have to make more specific
statements, like “Enew for QDA on handwritten digit recognition, when QDA is trained with the MNIST
data1 ”.

1
http://yann.lecun.com/exdb/mnist/

54
5.2 Estimating Enew

5.2 Estimating Enew


There are multiple reasons for a machine learning engineer to be interested in Enew , such as:

• judging if the performance is satisfying (whether Enew is small enough), or if more work should be
put into the solution and/or more training data should be collected
• choosing between different methods
• choosing hyperparameters (such as k in k-NN, the regularization parameter in ridge regression or
the number of hidden layers in deep learning)
• reporting the expected performance to the customer

As discussed above, we can unfortunately not compute Enew in any practical situation. We will therefore
explore some possibilities to estimate Enew , which eventually will lead us to a very useful concept known
as cross-validation.

5.2.1 Etrain 6≈ Enew : We cannot estimate Enew from training data


Let us start with defining the average training error,

n
1X
Etrain , y (xi ; T ), yi ) ,
E(b (5.4)
n
i=1

where {xi , yi }ni=1 is the training data T . Etrain simply describes how well a method performs on the
training data on which it was trained. In contrast to Enew , we can always compute Etrain .
We usually assume that T consists of samples from p(x, y). This assumption means that the training
data is collected under similar circumstances as the ones the learned model will be used under, which
seems reasonable. (If it was not, we would have very little reasons to believe the training data would tell
us anything useful.) When an integral is hard to compute, it can be numerically approximated with a sum
(see details in Appendix A.2). Now, the question is if the integral in Enew can be well approximated by the
sum in Etrain , like
Z n
?? 1X
Enew = y (x; T ), y))p(x, y)dxdy ≈
E(b y (xi ; T ), yi ) = Etrain .
E(b (5.5)
n
i=1

Or, put differently: Can we expect a method to perform equally well (or badly) when faced with new,
previously unseen, data, as it did on the training data?
The answer is, unfortunately, no.

Time to reflect 5.1: Why can we not expect the performance on training data (Etrain ) to be a good
approximation for how a method will perform on new, previously unseen data (Enew )?

Equation (5.5) does not hold, and the reason is that the training data are not just any data points, but yb
depends on them since they are used for training the model. We cannot therefore expect (5.5) to hold.
(Technically, the conditions in Appendix A.2 are not fulfilled, since yb depends on T .)
As we will discuss more thoroughly later in Section 5.3.1, the average behavior of Etrain and Enew is, in
fact, typically Etrain < Enew . That means that a method usually performs worse on new, unseen data, than
on training data. The performance on training data is therefore not a good measure of Enew .

5.2.2 Etest ≈ Enew : We can estimate Enew from test data


We could not use the “replace-the-integral-with-a-finite-sum” trick to estimate Enew by Etrain , due to the
fact that it effectively meant using the training data twice: first, to train the model (b
y in (5.4)) and second,

55
5 How well does a method perform?

All available data

Training data Test data

Figure 5.1: The test data set approach: If we split the available data in two sets and train the model on the training
set, we can compute Etest using the test set. The more data that are in the test data set, the less variance (better
estimate) in Etest , but the less data left for training the model. The split here is only pictorial, in practice one should
always split the data randomly.

to evaluate the error function (the sum in (5.4)). A remedy is to set aside some test data {xj , yj }mj=1 ,
which are not used for training, and then use the test data only for estimating the model performance
m
1 X
Etest , y (xj ; T ), yj ) .
E(b (5.6)
m
j=1

In this way, not all data will be used for training, but some data points (the test data) will be saved and
used only for computing Etest . This is illustrated by Figure 5.1.
Be aware! If you are splitting your data into a training and test set, always do it randomly! Someone
might—intentionally or unintentionally—have sorted the data set for you. If you do not split randomly,
you might end up having only one class in your training data, and another class in your test data . . .
As long as m ≥ 1, it can be shown that Etest is an unbiased estimate of Enew (meaning that if the entire
procedure is repeated multiple times, the average value of Etest would be Enew ). That is reassuring, but
it does not tell us how close Etest will be to Enew in a single experiment. However, the variance of Etest
decreases when the size of test data m increases; a small variance of Etest means that we can expect it to
be close to Enew . Thus, if we take the test data set big enough, Etest will be close to Enew . On the other
side the amount of data available is usually limited in real machine learning problems, meaning the more
data points we put into the test set, the fewer data points are left for training. Typically the more training
data, the smaller Enew (which we will discuss later in Section 5.3). Achieving a small Enew is our ultimate
goal. We are therefore faced with the following dilemma: the better we want to know Enew (more test data
gives less variance in Etest ), the worse we have to make it (less training data increases Enew ). That is not
very satisfying.
One could suggest the following two-step procedure, to circumvent the situation:
(i) Split the available data in one training and one test set, train the model on the training data and
compute Etest using test data (as in Figure 5.1).
(ii) Train the model again, this time using the entire data set.
By such a procedure, we get both a value of Etest and a model trained using the entire data set. That is not
bad, but not perfect either. Why? To achieve small variance in Etest , we have to put lots of data in the test
data set. That means the model trained in step (i) will quite possibly be very different from the model
trained in step (ii). And Etest from step (i) is an estimate of Enew for the model from step (i), not the model
in step (ii). Consequently, if we use Etest from step (i) to, e.g., select a hyperparameter in step (i), we can
not be sure it was a wise choice for the model trained in step (ii). This procedure is, however, not very far
from cross-validation that we will present next.

5.2.3 Cross-validation: Eval ≈ Enew without setting aside test data


We would like to use all available data to train a model, and at the same time have a good estimate of Enew
for that model. After reading the previous section, that might appear as an impossible request. There is,
however, a clever solution called cross-validation.

56
5.2 Estimating Enew

All available data

batch 1 batch 2 batch 3 ··· batch c

for ` = 1, . . . , c
Validation data Training data

(1)
`=1 ··· → Eval

(2)
`=2 ··· → Eval

..
.

(c)
`=c ··· → Eval

Training data Validation data


end for
average = Eval

Figure 5.2: Illustration of c-fold cross-validation. The data is split in c batches of similar sizes. When looping over
` = 1, 2, . . . , c, batch ` is held out as validation data, and the model is trained on the remaining c − 1 data batches.
Each time, the trained model is used to compute the average error Eval for the validation data. The final model is
(`)

trained using all available data, and the estimate of Enew for that model is Eval , the average of all Eval .
(`)

The idea of cross-validation is simply to repeat the test data set approach (using a small test data set)
multiple times with a different test data set each time, in the following way:

(i) split the data set in c batches of similar size (see Figure 5.2), and let ` = 1

(ii) take batch ` as validation data, and the remaining batches as training data
(`)
(iii) train the model on the training data, and compute Eval as the average error on the validation data
(analogously to (5.6))

(iv) if ` < c, set ` ← ` + 1 and return to (ii). If ` = c, compute

c
1 X (`)
Eval , Eval (5.7)
c
`=1

(v) train the model again, this time using all available data points

More precisely, this procedure is known as c-fold cross-validation, and illustrated in Figure 5.2.
With c-fold cross-validation, we get a model which is trained on all data, as well as an approximation of
Enew for that model, namely Eval . Whereas Etest (Section 5.2.2) was an unbiased estimate of Enew (to
the cost of setting aside test data), Eval is only approximately unbiased. However, with c large enough, it
turns out to often be a sufficiently good approximation, and is commonly used in practice. Let us try to
understand how c-fold cross-validation works.
First, we have to distinguish between the final model, which is trained on all data in step (v), and the
intermediate models which are trained on all except a 1/c fraction of the data in step (iii). The key in

57
5 How well does a method perform?

c-fold cross-validation is that if c is large enough, the intermediate models are quite similar to the final
model (since they are trained on almost the same data set, only a fraction 1/c of the data is missing).
(`)
Furthermore, each intermediate Eval is by construction an unbiased estimate of Enew for its corresponding
(`)
intermediate model. Since the intermediate and final models are similar, Eval is approximately also
an unbiased estimate of Enew for the final model. Since the validation sets are small (only 1/c of all
(`)
available data), the variance of Eval is high, but when averaging over multiple high-variance estimates
(1) (2) (c)
Eval , Eval , . . . , Eval , the final estimate Eval (5.7) does not suffer from high variance.
We usually talk about training (or learning) as something that is done once. However, in c-fold
cross-validation the training is repeated c (or even c + 1) times. A common value for c is 10, but you
may of course try different values. For methods such as linear regression, the actual training (solving
the normal equations) is usually done within milliseconds on modern computers, and doing it an extra c
times is usually not really a problem. If one is working with computationally heavy methods, such as
certain deep neural networks, it is perhaps less appealing to increase the computational load by a factor of
c + 1. However, if one wants to get a good estimate of Enew , some version of cross-validation is most
often needed.
We now have a method for estimating Enew for a model trained on all available training data. A typical
use of cross-validation is to select different types of hyperparameters, such as k in k-NN or a regularization
parameter.

Be aware! For the same reason as with the test data approach, it is important to always split the data
randomly for cross-validation to work! A simple solution is to first randomly permute the entire data set,
and thereafter split it into batches.

58
5.3 Understanding Enew

5.3 Understanding Enew


Cross-validation is an important and powerful tool for estimating Enew . Designing a method with small
Enew is the goal in supervised machine learning, and it is therefore useful to also have a tool to estimate
how we are doing. However, more can be said to also understand what affects Enew . To be able to reason
about Enew , we have to introduce another abstraction level, namely the training-data averaged versions of
Enew and Etrain ,

Ēnew , ET [Enew ] , (5.8a)


Ētrain , ET [Etrain ] . (5.8b)

Here, ET denotes the expected value when the training data set T = {xi , yi }ni=1 (of a fixed size n) is
drawn from p(x, y). Thus Ēnew is the average Enew if we would train the model multiple times on different
training data sets, and similarly for Ētrain . The point of introducing these, as it turns out, is that we can
say more about the average behavior of Enew and Etrain , than we can say about Enew and Etrain when the
model is trained on one specific training data set T . Even though we in practical problems very seldom
encounter Ēnew (the training data is usually fixed), the insights we gain from studying Ēnew are still useful.

5.3.1 Enew = Etrain + generalization error


We have already discussed the fact that Etrain cannot be used in estimating Enew . In fact, it usually holds
that

Ētrain < Ēnew , (5.9)

Put in words, this means that on average, a method usually performs worse on new, unseen data, than on
training data. A methods ability to perform well on unseen data after being trained on training data, can be
understood as the method’s ability to generalize from training data. The difference between Enew and
Etrain is accordingly called the generalization error2 , as

generalization error , Enew − Etrain . (5.10)

The generalization error thereby gives a connection between the performance on training data and the
performance ‘in production’ on new, previously unseen data. It can therefore be interesting to understand
how big (or small) the generalization error is.

Generalization error and model complexity

The size of the generalization error depends on the method and the problem. Concerning the method, one
can typically say that the more the model has adopted to the training data, the larger the generalization
error. A theoretical study of how much a model adopts to training data can be done using the so-called
VC dimension, eventually leading to probabilistic bounds on the generalization error. Unfortunately those
bounds are usually rather conservative, and we will not pursue that formal approach any further.3 Instead,
we only use the vague term model complexity, by which we mean the ability of a method to adopt to
complicated patterns in the training data, and reason about what we see in practice. A model with high
complexity (such as a neural network) can describe very complicated relationships, whereas a model with
low complexity (such as LDA) is less flexible in what functions it can describe. For parametric methods,
the model complexity is related to the number of parameters that are trained. Flexible non-parametric
methods (such as trees with many leaf nodes or k-NN with small k) have higher model complexity than
parametric methods with few parameters, etc. Techniques such as regularization, early stopping and
dropout (for neural networks) effectively decrease the model complexity.
2
Sometimes Enew is called generalization error; not in this text. In our terminology we do not distinguish between the
generalization error for a model trained on a certain training data set, and its training-data averaged counterpart.
3
If you are interested, a good book is Abu-Mostafa, Magdon-Ismail, and Lin 2012.

59
5 How well does a method perform?

Underfit Overfit

Ēnew

Generalizarion error
Error

Ētrain

Model complexity

Figure 5.3: Behavior of Ētrain and Ēnew for many supervised machine learning methods, as a function of model
complexity. We have not made a formal definition of complexity, but a rough proxy is the number of parameters that
are learned from the data. The difference between the two curves is the generalization error. In general, one can
expect Ētrain to decrease as the model complexity increases, whereas Ēnew typically has a U-shape. If the model
is so complex that Ēnew is larger than it had been with a less complex model, the term overfit is commonly used.
Somewhat less common is the term underfit used for the opposite situation. The level of model complexity which
gives the minimum Ēnew (at the dotted line) would in a consistent terminology perhaps be called a balanced fit. A
method with a balanced fit is usually desirable, but often hard to find since we do not know neither Ēnew nor Enew in
practice.

Typically, higher model complexity implies larger generalization error. Furthermore, Ētrain usually
decreases as the model complexity increases, whereas Ēnew attains a minimum for some intermediate
model complexity value: too small and too high model complexity both raises Ēnew . This is illustrated
in Figure 5.3. The region where Ēnew is larger than its minimum due to too high model complexity is
commonly referred to as overfit. The other region (where Ēnew is larger than its minimum due to too small
model complexity) is sometimes referred to as underfit. In a consistent terminology, the point where Ēnew
attains it minimum could be referred to as a balanced fit. Since the goal is to minimize Ēnew , we are
interested in finding this point. We also illustrate this by Example 5.1.

Remark 5.3 This and the next section discuss the usual behavior of Ēnew , Ētrain and the generalization
error. We use the term ‘usually’ because there are so many supervised machine learning methods and
problems that it is almost impossible to make any claim that is always true for all possible situations.
Pathological counter-examples may exist. One should also keep in mind that claims about Ētrain and Ēnew
are about the average behavior, which hopefully is clear in Example 5.1.

60
5.3 Understanding Enew

Example 5.1: Etest and Enew in a simulated example

We consider a simulated binary classification example with two-dimensional inputs x. On the


contrary to all real world machine learning problems, in a simulated problem like this we can
actually compute Enew , since we do know p(x, y) (otherwise we could not make the simulation).
In this example, p(x) is a uniform distribution on the square [−1, 1]2 , and p(y | x) is defined as
follows: all points above the dotted curve in the figure below are green with probability 0.9, and
points below the curve are red with probability 0.9. (The dotted line is also the decision boundary
for Bayes’ classifier. Why? And what would Enew be for Bayes’ classifier?)
1

x2 0

−1
−1 0 1
x1

We generate n = 100 samples as training data, and learn three classifiers: a logistic regression
classifier, a QDA classifier and a k-NN classifier with k = 2. If we are to rank these methods
in model complexity order, logistic regression is simpler than QDA (logistic regression is a
linear classifier, whereas QDA is more general), and QDA is simpler than k-NN (since k-NN is
non-parametric and can have rather complicated decision boundaries). We plot their decision
boundaries, together with the training data:
Logistic regression QDA k-NN, k = 2
1 1 1

0 0 0
x2

x2

x2

−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
x1 x1 x1

For each of these three classifiers, we can compute Etrain by simply counting the fraction of
training data points that are on the wrong side of the decision boundary. From left to right, we
get Etrain = 0.17, 0.16, 0.11. Since we are in a simulated example, we can also access Enew
(or rather estimate it numerically by simulating a lot of test data), and from left to right we get
Enew = 0.22, 0.15, 0.24. This pattern resembles Figure 5.3, except for the fact that Enew is smaller
than Etrain for QDA. Is this unexpected? Not really, what we have discussed in the main text is the
average Ēnew and Ētrain , not the situation with Enew and Etrain for one particular set of training
data. We therefore repeat this experiment 100 times, and compute the average Ēnew and Ētrain over
those 100 experiments:
Logistic regression QDA k-NN with k = 2
Ētrain 0.17 0.14 0.10
Ēnew 0.18 0.15 0.19
This follows Figure 5.3 well: The generalization error (difference between Ēnew and Ētrain ) is
positive and increases with model complexity, Ētrain decreases with model complexity, and Ēnew
has its minimum for QDA. This suggests that k-NN with k = 2 suffers from overfitting for this
problem, whereas logistic regression is a case of underfitting.

61
5 How well does a method perform?

Ēnew
Ēnew
Error

Error
Ētrain

Ētrain

Size of training data n Size of training data n


(a) Simple model (b) Complex model

Figure 5.4: Typical relationship between Ēnew , Ētrain and the number of data points n in the training data set. The
generalization error (difference between Ēnew and Ētrain ) decreases, at the same time as Ētrain increases. Typically, a
more complex model (right panel) will for large enough n attain a smaller Ēnew than a simpler model (left panel)
would on the same problem (the axes of the figures are comparable). However, the generalization error is typically
larger for a more complex model, in particular when there is little training data n.

Generalization error and size n of training data

The previous section and Figure 5.3 are concerned about the relationship between Ēnew , Ētrain , the
generalization error (their difference) and the model complexity. Yet another important aspect is the
size of the training data set, n. Intuitively, one may expect that the more training data, the better the
possibilities to learn how to generalize. Yet again, we do not make a formal derivation, but we can in
general expect that the more training data, the smaller the generalization error. On the other hand, Ētrain
typically increases as n increases, since most models are not able to fit all training data perfectly if there
are too many of them. A typical behavior of Ētrain and Ēnew is sketched in Figure 5.4.

5.3.2 Enew = bias2 + variance + irreducible error

We will now introduce another decomposition of Enew into the terms known as bias and variance (which
we can affect by our choice of method) as well as an unavoidable component of irreducible noise. This
decomposition is most natural in the regression setting, but the intuition carries over also to classification.
We first make the assumption that the true relationship between input and output can be described as
some (possibly very complicated) function f (x) plus independent noise ε,

y = f (x) + ε, with E [ε] = 0 and var(ε) = σ 2 . (5.11)

Since we have made no restriction on f , this is not a very restrictive assumption, but we can expect it to
describe the reality well.
In our notation, yb(x; T ) represents the model when it is trained on training data T . We now also
introduce the average trained model

g(x) , ET [b
y (x; T )] . (5.12)

As before, ET denotes the expected value over training data drawn from p(x, y). Thus, g(x) is the
(hypothetical) average model we would achieve, if we could marginalize all random effects associated
with the training data.

62
5.3 Understanding Enew

Underfit Overfit

Ēnew
r ror
e du c ible e
Irr
Error

Variance

Bias2

Model complexity

Figure 5.5: The bias-variance decomposition of Ēnew (cf. Figure 5.3). The bias typically decreases with model
complexity; the more complicated the model is, the less systematic errors in the predictions. The variance, on the
other hand, typically increases as the model complexity grows; the more complex the model is, the more it will adapt
to peculiarities that by chance happened to occur in the particular training data set that was used. The irreducible
error is always constant. In order to achieve a small Enew , one has to trade between bias and variance (for example
by using another model, or using regularization as in Example 5.2). in order to avoid over- and underfit.

We are now ready to rewrite Ēnew , the average expected new data error, as
h h ii h h ii
Ēnew = ET E? (b y (x? ; T ) − y? )2 = E? ET (b y (x? ; T ) − f (x? ) − ε)2
h h i i
= E? ET (b y (x? ; T ))2 − 2ET [b y (x? ; T )] f (x? ) + f (x? )2 + σ 2
h i
= E? ET [(by (x? ; T ))2 ] − g(x? )2 + g(x? )2 − 2g(x? )f (x? ) + f (x? )2 + σ 2 . (5.13)
| {z } | {z }
y (x? ;T )−g(x? ))2 ]
ET [(b (g(x? )−f (x? ))2

Here, we used the fact that ε is independent of x, as well as ET [g(x)] = g(x).


The term (g(x? ) − f (x? ))2 describes in a sense how much the model—if it could be ‘perfectly trained’
with an infinite amount of training data—differs from the true f (x? ). Hence, we will refer to this term
bias2 (reads ”squared bias”).
The other term, ET [(b
y (x? ; T ) − g(x? ))2 ], captures how much the model yb(x; T ) varies each time it is
trained on a new training data set. If this term is small, the trained model is not very sensitive to exactly
which data points happened to be in the training data, and vice versa. We return to (5.13),
h i h h ii
Ēnew = E? (g(x? ) − f (x? ))2 + E? ET (b y (x? ; T ) − g(x? ))2 + |{z}σ2 . (5.14)
| {z } | {z } Irreducible
Bias2 Variance error

The irreducible error is simply an effect of the assumed intrinsic stochasticity of the problem – it is not
possible to predict ε since it is truly random. We will hence leave the irreducible error as it is, and focus
on the bias and variance terms to further understand how Ēnew is affected by our choice of method; there
are interesting situations where one can decrease Ēnew by trading bias for variance or vice versa.

63
5 How well does a method perform?

The bias-variance trade-off and its relation to model complexity

We continue with the (never properly defined) notion of model complexity. High model complexity means
that the model is able to express more complicated functions, implying that the bias term is small. On
the other hand, the more complex a model is, the more it will adapt to training data T —not only to the
interesting patterns, but also to the actual data points and noise that happened to be in that realization of
the training data. Had there been another realization of T , the trained model could have looked (very)
differently. Exactly such ‘sensitivity’ to training data is described by the variance term. In summary, if
everything else is fixed and only the model complexity increases, the variance also increases, but the bias
decreases. The optimal model complexity (smallest Enew ) is therefore usually ‘somewhere in the middle’,
where the model has a good trade-off between bias and variance. This is illustrated by Figure 5.5.
One should remember that the model complexity is not simply the number of parameters in the model,
but rather a measure of how much the model adapts to complicated patterns in the training data. We
introduced regularization in Section 2.6 as a method to counteract overfit, or effectively decreasing the
model complexity, without changing the number of parameters in the model. Regularization therefore
gives a tool for changing the model complexity in a continuous fashion, which opens up for fine-tuning of
this bias-variance trade-off. This is further explored in Example 5.2
Example 5.2: Regularization—trading bias for variance

Let us consider a simulated regression example. We let p(x) and p(y | x) be defined as x ∼ U[0, 1]
and
y = 5 − 2x + x3 + ε, ε ∼ N (0, 1) .
We let the training data consist of only n = 10 samples. We now try to model the data using linear
regression with a 4th order polynomial, as

y = β0 + β1 x + β2 x2 + β3 x3 + β4 x4 + ε,

where we assume ε to have a Gaussian distribution (which happens to be true in this example),
so that we will end up with the normal equations. Since the model contains the true model and
least squares would not introduce any systematic errors, the bias term (5.14) would be exactly
zero. However, learning 5 parameters from only 10 data points will almost inevitably lead to very
high variance and overfit, so we decide to train the model with a regularized method, namely
Ridge Regression. Using regularization means that we will trade the unbiasedness (regularization
introduces systematic bias in how the model is trained) for smaller variance. Two examples on
what it could look like, for different regularization parameters, are shown:

γ = 10 γ = 0.001
10 10

8 8

6 6
y

4 4

2 2

0 0
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
x x

The dots are the n = 10 data points, the solid line is the trained model and the dashed line is
the true model. In the case with γ = 0.001, the plot suggests overfit, whereas γ = 10 seems to
be a case of underfit. It is clear how regularization affects the model complexity: with a small
regularization (in this case γ = 0.001), the model is prone to adapt to the noise in the training data.

64
5.3 Understanding Enew

The effect would be even more severe with no regularization (γ = 0). Heavier regularization (in
this case γ = 10) effectively prevents the model from adapting well to the training data (it pushes
the parameters, including β0 , towards 0).
Let us understand this in terms of bias and variance. In the low-regularized case, the trained
model (solid line) will look very different each time, depending on what x-values and noise happen
to be in the training data: a high variance. However, if one would repeat the experiment many
times with different training data, the average model will probably be relatively close to the true
model: a low bias. The completely opposite situation is found in the highly regularized case: the
variance is low (the model will be quite similar each time, no matter what realization of the training
data it is trained on), and the bias is high (the predictions from the model will systematically be
closer to zero than the true model).
Since we are in a simulated environment, we can repeat the experiment multiple times, and
thereby compute the bias and variance terms (or rather numerically estimate them, since we can
simulate as much training and test data as we want). We plot them in the very same style as
Figures 5.3 and 5.5 (note the reversed x-axis: a smaller regularization parameter corresponds to a
higher model complexity). For this problem, the optimal value of γ would have been about 0.7
(since Ēnew attains its minimum there).

Ēnew
Error

or
2

ible err
Bias2 Ētrain Irreduc Variance

0
101 100 10−1 10−2 10−3

Regularization parameter

If this had been a real problem with a fixed data set, we could of course not have made this plot.
Instead, one would have to rely on cross-validation for estimating Enew for that particular data set
(and not its average Ēnew ).

65
5 How well does a method perform?

The bias-variance trade-off and its relation to the size n of training data

In the first place, the bias term is a property of the model rather than of the training data set, and we may
think4 of the bias term as independent of the number of data points n in the training data. The variance
term, on the other hand, varies highly with n. As we know, Ēnew typically decreases as n increases, and
essentially the entire decline in Ēnew is because of the decline in the variance. Intuitively, the more data,
the more information about the parameters, meaning less variance. This is summarized by Figure 5.6.

Irreducible error Ēnew Ēnew


Variance
Error

Error
Variance Irreducible
error
Bias2

Bias2

Size of training data n Size of training data n


(a) Simple model (b) Complex model

Figure 5.6: The typical relationship between bias, variance and the size n of the training data set (cf. Figure 5.4).
The bias is (approximately) constant, whereas the variance decreases as the size of the training data set increases.

4
Indeed, the average model g might be different if we are averaging over an infinite number of models each trained with n = 2
or n = 100 000 data points. That effect is, however, conceptually not very interesting here, and we will not treat it further.

66
6 Ensemble methods

In the previous chapters we have introduced some fundamental methods for machine learning. In this
chapter we will introduce techniques of a slightly different flavor, referred to as ensemble methods. These
methods are based on the idea of combining the predictions from many so called base models. They can
therefore be seen as a type of meta-algorithms, in the sense that they are methods composed of other
methods.
We start in Section 6.1 by introducing a general technique referred to as bootstrap aggregating, or
bagging for short. The idea behind bagging is to train multiple models of the same type in parallel, but on
slightly different “versions” of the training data. By averaging the predictions of the resulting ensemble of
models it is possible to reduce the variance compared to using only a single model. This idea is extended
in Section 6.2, resulting in a powerful off-the-shelf method called random forests. Random forests make
use of classification or regression trees as base models. Each tree is randomly perturbed in a certain way
which opens up for additional variance reduction. Finally, in Section 6.3 we derive an alternative ensemble
method known as boosting. Boosting is different from bagging and random forests, since its base models
are learned sequentially, one after the other, so that each model tries to correct for the mistakes made by
the previous ones. By taking a weighted average of the predictions made by the base models, it is possible
to turn the ensemble of “weak” models into one “strong” model.

6.1 Bagging
As discussed in the previous chapter, a central concept in machine learning is the bias–variance trade-off.
Roughly speaking, the more flexible a model is, the lower its bias will be. That is, a flexible model
is capable of representing complicated input–output relationships. This is of course beneficial if the
actual relationship between inputs and outputs is complicated, as is often the case in machine learning
applications. We have seen examples of highly flexible models, not least in Chapter 4 when we discussed
non-parametric models. For instance, k-NN with a small value of k, or a classification tree that is grown
deep, are capable of expressing very complicated decision boundaries. However, as previously discussed
there is a price to pay when using a very flexible model: fitting it based on a finite number of training data
points can result in over-fitting1 , or equivalently, high model variance.
Motivated by this issue, we can ask the following question:

Given access to a low-bias, high-variance method, is there a way to reduce the variance while
keeping the bias low?

In this section we will introduce a general technique for accomplishing this, referred to as bootstrap
aggregating, or bagging.

6.1.1 Variance reduction by averaging


To answer the question posed above our starting point will be a basic property of random variables, namely
that averaging reduces variance. To formalize this, let z1 , . . . , zB be a collection of identically distributed
(but possibly dependent) random variables with mean value E [zb ] = µ and variance Var[zb ] = σ 2 for
b = 1, . . . , B. Furthermore, assume that the average correlation2 between any pair of variables is ρ. Then,
1
Both a k-NN model with k = 1 or a classification tree with a single data point per leaf node will result in zero training error,
typical for severe
P over-fitting.
2
That is B(B−1)
1 2
b6=c E [(zb − µ)(zc − µ)] = ρσ .

67
6 Ensemble methods

computing the mean and the variance of the average of these variables we get
" B
#
1 X
E zi = µ, (6.1a)
B
b=1
" B
#
1 X 1−ρ 2
Var zb = σ + ρσ 2 . (6.1b)
B B
b=1

The first equation tells us that the mean is unaltered by averaging a number of identically distributed
random variables. Furthermore, the second equation tells us that the variance is reduced by averaging, as
long as the correlation ρ < 1. Indeed, the first term in the variance expression can be made arbitrarily
small by increasing B. We also note that the smaller ρ is, the larger the possible reduction in variance.
To see how this relates to the question posed above, we note that the bias of a prediction model is
tightly connected to its mean value. Consequently, by averaging the predictions from several identically
distributed models, each with a low bias, the bias remains low (cf. (6.1a)) and the variance is reduced (cf.
(6.1b)). How then can we construct such a collection—or ensemble—of prediction models? To answer
this question we will start with an unpractical assumption, which will later be relaxed, namely that we have
access to B independent training data sets3 , each of size n. Let these data sets be denoted by T 1 , . . . , T B .
We can then train a separate low-bias model, such as a deep classification or regression tree, on each
separate data set.
For concreteness we consider the regression setting. Each of the B regression models can be used
to predict the output for some test input x? . Since (by our unpractical assumption) the B training data
sets are independent, this will result in B independent and identically distributed predictions yb?1 , . . . , yb?B .
Taking the average of these predictions we get,
B
1 X b
yb?avg = yb? . (6.2)
B
b=1

In the classification setting we would instead use a majority vote among the B ensemble members.
The average prediction in (6.2) has the same bias as each of the individual yb?b ’s, but its variance is
reduced by a factor B. Indeed, in this case we have ρ = 0 in the expression (6.1b) since the individual
data sets were assumed to be independent. However, as mentioned above this is not a realistic assumption,
so in order to use this variance reduction technique in practice we need to do something differently. To
this end we will make use of a trick known as data bootstrapping.

6.1.2 The bootstrap


The bootstrap is a powerful and widely applicable statistical technique. Its main usage is in fact for
quantifying the uncertainties in statistical estimators, but here we will use it for another purpose, namely
to mimic the variance reduction approach outlined above. To apply this technique we need to obtain
multiple training data sets, but we only have access to one data set T = {xi , yi }ni=1 . We assume that
simply collecting more data is not an option (if it were, we would be better off simply increasing the size
of T instead). Instead, we assume that T provides a good representation of the real-world data generating
process, in the sense that even if we were to collect more training data, these data points would likely be
similar to the training data points already contained in T . Therefore, we can simulate “new” training data
sets by sampling data points from the original training data. In statistical terms, instead of sampling from
the population (collecting more data), we sample from the available training data which is assumed to
provide a good representation of the population.
3
We could of course simply chop the actual training data up into B independent chunks. However, this would mean that the size
of each such data set is only n/B, which would counteract the gain offered by variance reduction. In some situations this can
still be a good idea though: if n is very large, training a single model on all data can be computationally prohibitive. Instead,
we can divide the available data into smaller chucks, train independent models on these chunks in parallel, and average their
predictions.

68
6.1 Bagging

The bootstrapping method is stated in algorithm 5 and illustrated in in Example 6.1 below. Note that
the sampling is done with replacement, meaning that the resulting bootstrapped data set may contain
multiple copies of some of the original training data points, whereas other data points are not included in
the bootstrap sample.

Algorithm 5: The bootstrap.


Data: Training data set T = {xi , yi }ni=1
Result: Bootstrapped data Te = {exi , yei }ni=1
1 for i = 1, . . . , n do
2 Sample ` uniformly on the set of integers {1, . . . , n}
3 Set x
ei = x` and yei = y`
4 end

Example 6.1: The bootstrap

Consider the same training data set as used in Example 4.2, consisting of n = 10 data points with
a two-dimensional input x = (x1 , x2 ) and a binary output y ∈ {Blue, Red}. The data set is shown
below.

Original training data, T = {xi , yi }10


i=1
10

Index x1 x2 y
1 9.0 2.0 Blue 8

2 1.0 4.0 Blue


3 4.0 6.0 Blue 6
4 4.0 1.0 Blue
x2

5 1.0 2.0 Blue 4


6 1.0 8.0 Red
7 6.0 4.0 Red
8 7.0 9.0 Red
2

9 9.0 8.0 Red


10 9.0 6.0 Red 0
0 2 4 6 8 10
x1

To generate a bootstrapped data set Te = {e i=1 we simulate 10 times with replacement


xi , yei }10
from the index set {1, . . . , 10}, resulting in the indices

{2, 10, 10, 5, 9, 2, 5, 10, 8, 10}

Thus, (e x2 , ye2 ) = (x10 , y10 ), etc. We end up with the following data set, where
x1 , ye1 ) = (x2 , y2 ), (e
the numbers in parentheses in the right panel indicate that there are multiple copies of some of the
original data points in the bootstrapped data.

Bootstrapped data, Te = {e
xi , yei }10
i=1

69
6 Ensemble methods

10

Index e1
x e2
x ye
2 1.0 4.0 Blue 8

10 9.0 6.0 Red


10 9.0 6.0 Red 6
5 1.0 2.0 Blue (4)

x2
9 9.0 8.0 Red 4
(2)
2 1.0 4.0 Blue
5 1.0 2.0 Blue (2)
10 9.0 6.0 Red
2

8 7.0 9.0 Red


10 9.0 6.0 Red 0
0 2 4 6 8 10
x1

By running algorithm 5 repeatedly B times we obtain B identically distributed bootstrapped data sets
Te 1 , . . . , Te B . We can then use the bootstrapped data sets to train B low-bias regression (or classification)
models and average their predictions, ye?1 , . . . , ye?B , analogously to (6.2):

B
1 X b
yb?bag = ye? . (6.3)
B
b=1

The averaged predictions are in this case not independent since the bootstrapped data sets all come from
the same original data T . However the hope is that the predictions are sufficiently uncorrelated (ρ is small
enough) to result in a meaningful variance reduction. Experience has shown that this is indeed often the
case. We illustrate the bagging method in Example 6.2.

Example 6.2: Illustration of bagging

Consider a toy regression problem, where the 5-dimensional input is multivariate Gaussian
x ∼ N (0, Σ) with  
1 0.98 0.98 0.98 0.98
0.98 1 0.98 0.98 0.98
 

Σ = 0.98 0.98 1 0.98 0.98 
0.98 0.98 0.98 1 0.98
0.98 0.98 0.98 0.98 1
and the output is y = x21 +  with  ∼ N (0, 1). We observe a training data set consisting of
n = 30 inputs and outputs. Due to the strong correlations between the input variablesa , they are all
informative about the value of the output, despite the fact that the output only depends directly on
the first input variable, x1 .
We start by training a single regression model, branching until there are at most 3 observations
per leaf node, resulting in the model shown below (the values in the leaf nodes are the constant
predictions within each region).

70
6.1 Bagging

X.1 < 1.1


>= 1.1
X.2 >= −1.1
< −1.1
X.3 < 0.21
>= 0.21
X.5 >= −0.57 X.1 >= 0.74
< −0.57 < 0.74
X.5 < −0.44
>= −0.44
−0.65 0.35 0.53 0.34 1.8 2.5 4.3

To apply the bagging method, we need to generate bootstrapped data sets. Running algorithm 5
once, we obtain a set of sampled indices, shown by the histogram below. Note that some data points
are sampled multiple times, whereas other data points are not included at all in the bootstrapped
data.

Histogram of indices
4
3
Frequency

2
1
0

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

indices

Training another regression tree on the bootstrapped data gives a slightly different model:

X.2 < 1.2


>= 1.2
X.2 >= −1.1
< −1.1
X.5 < −0.36
>= −0.36
X.2 >= −0.34 X.2 >= 0.35
< −0.34 < 0.35
X.1 >= −0.65
< −0.65
−1.2 0.19 0.77 0.47 1.5 2.8 3.4

This procedure can then be repeated to generate an arbitrary number of bootstrap data sets and
regression models, each one possibly different from the ones before due to the randomness in the
bootstrap procedureb . Below we show the first 12 models that we obtain.

71
6 Ensemble methods

X.1 < 1.1 X.2 < 1.2 X.3 < 1.6 X.1 >= −1.4
>= 1.1 >= 1.2 >= 1.6 < −1.4
X.2 >= −1.1 X.2 >= −1.1
X.2 >= −1.1 X.5 < 0.83
< −1.1 < −1.1
< −1.1 >= 0.83
X.3 < 0.21 X.5 < −0.36
>= 0.21 >= −0.36 X.5 < −0.31 X.1 >= −0.49
X.5 >= −0.57 X.1 >= 0.74 X.2 >= −0.34 X.2 >= 0.35 >= −0.31 < −0.49
< −0.57 < 0.74 < −0.34 < 0.35
X.3 >= −0.85 X.5 < −0.011
X.5 < −0.44 X.1 >= −0.65
< −0.85 >= −0.011
>= −0.44 < −0.65
−0.65 0.35 0.53 0.34 1.8 2.5 4.3 −1.2 0.19 0.77 0.47 1.5 2.8 3.4 −0.21 0.77 0.84 2.7 5.7 −0.24 0.33 0.73 2 2.4

X.3 < 0.25 X.1 < 1.1 X.1 < 1.1 X.5 < 1.2
>= 1.1
>= 0.25 >= 1.1 >= 1.2
X.1 >= −0.9
X.1 >= −0.9 X.2 >= 2
X.1 >= −0.9 X.1 < 0.79 < −0.9 X.3 >= −0.24
X.4 < −0.25
< −0.9 <2
< −0.9 >= 0.79 < −0.24
>= −0.25 X.3 < 0.25
X.2 >= −0.26 X.2 >= 0.63 >= 0.25 X.5 < −0.011 X.2 < −0.59
< 0.63
< −0.26 X.2 >= −0.43 X.1 >= 0.78 >= −0.011 >= −0.59
X.3 < 0.21
< −0.43 < 0.78
X.2 < −0.69 >= 0.21 X.3 >= 0.57
X.1 >= 0.23 X.2 < −0.63
>= −0.69 < 0.57
< 0.23 >= −0.63
−0.36 −0.2 0.42 1.9 1.7 2.5 −0.26 0.14 0.036 0.96 1.7 3.1 4.2 −0.34 −0.11 0.98 0.14 1.6 1.6 3.2 4.5 −0.75 0.12 0.96 0.72 1.4 4.9

X.4 < 1.1 X.1 < 0.82 X.2 < 1.2 X.3 < 0.18
>= 1.1
>= 0.82 >= 1.2 >= 0.18
X.3 >= −0.89
X.1 >= 0.78 X.1 >= −0.9 X.1 < 0.83
< −0.89 X.2 >= −1.1
X.3 < 0.21
< 0.78 < −0.9 >= 0.83
< −1.1
>= 0.21 X.5 < 0.37 X.5 < −0.31 X.3 >= 0.53
X.2 < −0.67 X.3 >= 0.53 X.5 < −0.31 >= 0.37 >= −0.31 < 0.53
>= −0.67 < 0.53
>= −0.31 X.5 >= −0.57 X.5 >= −0.51
X.1 >= −0.49
< −0.57 < −0.51
< −0.49 X.5 >= −0.57
X.5 < −0.011 X.3 < 0.019 X.4 < −0.77 X.4 < −0.83
< −0.57
>= −0.011 >= 0.019 >= −0.77 >= −0.83
−0.28 −0.11 0.54 0.93 0.99 1.9 2.8 3.7 −0.25 0.3 1.3 2.5 3.1 −0.16 −0.23 0.61 0.23 1.6 1.6 4.5 −0.81 −0.27 0.69 0.78 2.1 0.7 2.1 3.1

To test the performance of the bagging method for this example, we simulate a large test data set
consisting of ntest = 2000 samples from the true data distribution (which is known since we are
working with a toy model). We can then compute the root mean squared error (RMSE) for varying
number of ensemble members B. The results are shown below:

Single tree
Bagging
2.8
Test error (RMSE)

2.6

2.4

2.2

0 20 40 60 80 100
Ensemble members (B)

Two things are worth noting in the figure above: First, the bagging method does improve the RMSE
over a single tree as a result of variance reduction (here, we get roughly an 18 % reduction in
RMSE, but this is of course problem dependent). Second, the performance of the bagging method
improves as we increase the number of ensemble members, but saturates at some point beyond
which no further improvement (nor degradation) is noticeable.
a
Recall that the values on the diagonal of the covariance matrix are the marginal variances of the elements xi , and the
off-diagonal values are covariances. Specifically, since all marginal variances in this example are 1, any pair of
variables xi and xj for i 6= j has a correlation of 0.98.
b
With n training data points there are in total nn /n! unique bootstrap datasets.

At first glance, one might think that a bagging model (6.3) becomes more “complex” as the number of
ensemble members B increase, and that we therefore run a risk of overfitting for large B. However, from
the example above this does not appear to be the case. Indeed, the test RMSE plateaus at some value
and does not start start to increase even for large values of B. This is in fact the expected (and intended)
behavior. Using more ensemble members does not make the model more flexible, but on the contrary,

72
6.2 Random forests

reduces its overfitting (or equivalently, its variance). This can be seen by considering the limiting behavior
as B → ∞, in which case the bagging model becomes

1 X b B→∞ h b i
B
yb?bag = ye? −→ E ye? | T , (6.4)
B
b=1

where the expectation is w.r.t. the randomness of the bootstrapping algorithm. Since the right hand side of
this expression does not depend on B it can be thought of as a regression model with “limited flexibility”,
and the flexibility of the bagging model will therefore not increase as B becomes large. With this in mind,
in practice the choice of B is mainly guided by computational constraints. The larger B the better, but
increasing B when there is no further reduction in test error is computationally wasteful.

6.2 Random forests


As pointed out above, the variance reduction obtained by averaging is limited by the correlation between
the individual ensemble members (cf. (6.1b)). A natural question to ask is therefore if it is possible to
reduce this correlation. One simple trick for accomplishing this is a method known as random forests
(Breiman 2001).
While bagging is a general technique that in principle can be used to reduce the variance of any base
model, random forests assumes that these base models are given by classification or regression trees. The
idea is to inject additional randomness when constructing each tree, in order to reduce the correlation
among them. At first this might seem like a silly idea: randomly perturbing the training of a model should
intuitively degrade its performance. There is a rationale for this perturbation, however, which we will
discuss below, but first we present the details of the algorithm.
Let Te b be one of the B bootstrapped data sets. To train a classification or regression tree on this data
we proceed as usual (see Section 4.2) , but with one difference. Throughout the training, whenever we
are about to split a node we do not consider all possible input variables x1 , . . . , xp as splitting variables.
Instead, we pick a random subset consisting of m ≤ p inputs, and only consider these m variables as
possible splitting variables. At the next splitting point we draw a new random subset of m inputs to use
as possible splitting variables, and so on. Naturally, this random subset selection is done independently
for each of the B ensemble members, so that we (with high probability) end up using different subsets
for the different trees. This will cause the B trees to be less correlated and averaging their predictions
can therefore result in larger variance reduction compared to bagging. It should be noted, however, that
this random perturbation of the training procedure will increase both the bias and the variance of each
individual tree. That being said, experience has shown that the reduction in correlation is the dominant
effect, so that the overall prediction loss is often reduced.
To understand why it can be a good idea to only consider a subset of inputs as splitting variables,
recall that tree-building is based on recursive binary splitting which is a greedy algorithm. This means
that the algorithm can make choices early on that appear to be good, but which nevertheless turn out to
be suboptimal further down the splitting procedure. For instance, consider the case when there is one
dominant input variable. If we construct an ensemble of trees using bagging, it is then very likely that all
of the ensemble members will pick this dominant variable up as the first splitting variable, making all trees
identical (i.e., perfectly correlated) after the first split. If we instead apply random forests, then some of
the ensemble members will not even have access to this dominant variable at the first split4 , forcing them
to split according to some other variable. While there is of course no reason for why this would improve
the performance of the individual tree, it could prove to be useful further down the splitting process, and
since we average over many ensemble members the overall performance can be improved.
The choice of m is a tuning parameter, where for m = p we recover the basic bagging method described

previously. As a rule-of-thumb we can set m = p for classification problems and m = p/3 for regression
4
That is, it is likely that the dominant variable is not contained in the random subset selected at the first split for some of the
ensemble members.

73
6 Ensemble methods

problems (values rounded down to closest integer). A more systematic way of selecting m, however, is by
some type of cross-validation. We illustrate the random forest regression model in the example below.
Example 6.3: Random forests

We continue Example 6.2 and apply random forests to the same data. The test root mean squared
errors for different number of ensemble members are shown in the figure below.

3.5 Single tree


Bagging
Random forest
3
Test error (RMSE)

2.5

0 20 40 60 80 100
Ensemble members (B)

In this example, tthe decorrelation effect of the random input selection used in random forests
result in an additional 13 % drop in RMSE compared to bagging.

6.3 Boosting

6.3.1 The conceptual idea

Boosting is built on the idea that even a simple (or weak) high-bias regression or classification model
often can capture some of the relationship between the inputs and the output. Thus, by training multiple
weak models, each describing part of the input-output relationship, it might be possible to combine the
predictions of these models into an overall better prediction. Hence, the intention is to turn an ensemble of
weak models into one strong model.
Boosting shares some similarities with bagging, as discussed above. Both are ensemble methods, in
the sense that they are based on combining the predictions from multiple models (an “ensemble”). Both
bagging and boosting can also be viewed as meta-algorithms, in the sense that they can be used to combine
essentially any regression or classification algorithm—they are algorithms built on top of other algorithms.
However, there are also important differences between boosting and bagging which we will discuss below.
The first key difference is in the construction of the ensemble. In bagging we construct B models in
parallel. These models are random (based on randomly bootstrapped datasets) but they are identically
distributed. Consequently, there is no reason to trust one model more than another, and the final prediction
of the bagged model is based on a plain average or majority vote of the individual predictions of the
ensemble members.
Boosting, on the other hand, uses a sequential construction of the ensemble members. Informally, this
is done in such a way that each model tries to correct the mistakes made by the previous one. This is
accomplished by modifying the training data set at each iteration in order to put more emphasis on the
data points for which the model (so far) has performed poorly. In the subsequent sections we will see
exactly how this is done for a specific boosting algorithm known as AdaBoost (Freund and Schapire 1996).
First, however, we consider a simple example to illustrate the idea.

74
6.3 Boosting

Example 6.4: Boosting illustration

We consider a toy binary classification problem with two input variables, x1 and x2 . The training
data consists of n = 10 data points, 5 from each of the two classes. We use a decision stump, a
classification tree of depth one, as a simple (weak) base classifier. A decision stump means that we
select one of the input variables, x1 or x2 , and split the input space into two half spaces, in order to
minimize the training error. This results in a decision boundary that is perpendicular to one of the
axes. The left panel below shows the training data, illustrated by red crosses and blue dots for the
two classes, respectively. The colored regions shows the decision boundary for a decision stump
yb1 (x) trained on this data.

yb1 (x)
Iteration b = 1
yb2 (x)
Iteration b = 2
yb3 (x)
Iteration b = 3
yb (x)
boost model
Boosted
X2

X2

X2

X2
X1 X1 X1 X1

Figure 6.1: Three iterations of AdaBoost for toy problem. Training data points for the two classes is shown
by red crosses and blue dots, respectively, scaled according to the weights. The colored regions show the
three base classifiers (decision stumps), yb1 , yb2 , and yb3 .

The model yb1 (x) misclassifies three data points (red crosses falling in the blue region), which are
encircled in the figure. To improve the performance of the classifier we want to find a model that
can distinguish these three points from the blue class. To this end, we train another decision stump,
yb2 (x), on the same data. To put emphasis on the three misclassified points, however, we assign
weights {wi2 }ni=1 to the data. Points correctly classified by yb1 (x) are down-weighted, whereas
the three points misclassified by yb1 (x) are up-weighted. This is illustrated in the second panel
of Figure 6.1, where the marker sizes have been scaled according to thePweights. The classifier
yb2 (x) is then found by minimizing the weighted misclassification error, n1 ni=1 wi2 I{yi 6= yb2 (xi )},
resulting in the decision boundary shown in the second panel. This procedure is repeated for a
third and final iteration: we update the weights based on the hits and misses of yb2 (x) and train a
third decision stump yb3 (x) shown in the third panel. The final classifier, ybboost (x) is then taken as
a combination of the three decision stumps. Its (nonlinear) decision boundaries are shown in the
right panel.

6.3.2 Binary classification, margins, and exponential loss


Before diving into the details of the AdaBoost algorithm we will lay the groundwork by introducing
some notations and concepts that will be used in its derivation. AdaBoost was originally proposed for
binary classification (K = 2) and we will restrict our attention to this setting. That is, the output y can
take two different values which, in this chapter, we encode as −1 and +1. This encoding turns out to
be mathematically convenient in the derivation of AdaBoost. However, it is important to realize that
the encoding that we choose for the two classes is arbitrary and all the concepts defined below can be
generalized to any binary encoding (e.g. {0, 1} which we have used before).
Most binary classifiers yb(x) can be constructed by thresholding some real-valued function C(x) at 0.
That is, using our +1/ − 1 encoding we can write yb(x) = sign{C(x)}. In particular, it is the case for the
AdaBoost algorithm presented below. Note that the decision boundary is given by values of x for which
C(x) = 0. For simplicity of presentation we will assume that no data points fall exactly on the decision
boundary (which always gives rise to an ambiguity), so that we can assume that yb(x) as defined above is
always either −1 or +1.

75
6 Ensemble methods

Exponential loss
Misclassification loss

Loss
1

0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Margin, y · C(x)

Figure 6.2: Exponential loss.

Based on the function C(x) we define the margin of the classifier as

y · C(x). (6.5)

It follows that if y and C(x) have the same sign, that is if the classification is correct, then the margin
is positive. Analogously, for an incorrect classification y and C(x) will have different signs and the
margin is negative. More specifically, since y is either −1 or +1, the margin is simply |C(x)| for correct
classifications and −|C(x)| for incorrect classifications. The margin can thus be viewed as a measure of
certainty in a prediction, where values with small margin in some sense (not necessarily Euclidian!) are
close to the decision boundary. The margin plays a similar role for binary classification as the residual
y − f (x) does for regression.
Loss functions for classification can be defined in terms of the margin, by assigning small loss to positive
margins and large loss to negative margins. One such loss function is the exponential loss defined as

L(y, C(x)) = exp(−y · C(x)). (6.6)

Figure 6.2 illustrates the exponential loss and compares it against the misclassification loss, which is
simply I{y · C(x) < 0}.

Remark 6.1 The misclassification loss is often used to evaluate the performance of a classifier (in
particular if interest only lies in the number of correct and incorrect classification). However, it is typically
not suitable to use directly during training of the model. The reason is that it is discontinuous which is
problematic in a numerical minimization of the training loss. The exponential loss function, on the other
hand, is both convex and (infinitely many times) differentiable. These are nice properties to have when
optimizing the training loss. In fact, the exponential loss can in this way be seen as a convenient proxy for
misclassification loss during training. Other loss functions of interest are discussed in Section 6.3.5.

6.3.3 AdaBoost
We are now ready to derive a practical boosting method, the AdaBoost (Adaptive Boosting) algorithm
proposed by Freund and Schapire (1996). AdaBoost was the first successful practical implementation of
the boosting idea and lead the way for its popularity. Freund and Schapire were awarded the prestigious
Gödel Prize in 2003 for their algorithm.
Recall from the discussion above that boosting attempts to construct a sequence of B (weak) classifiers
yb1 (x), yb2 (x), . . . , ybB (x). Any classification model can in principle be used to construct these so

76
6.3 Boosting

called base classifiers—shallow classification trees are common in practice (see Section 6.3.4 for further
discussion). The individual predictions of the B ensemble members are then combined into a final
prediction. However, all ensemble members are not treated equally. Instead, we assign some positive
coefficients {αb }B
b=1 and construct the boosted classifier using a weighted majority vote according to
( B )
X
B
ybboost (x) = sign b b
α yb (x) . (6.7)
b=1

Note that each ensemble member votes either −1 or +1. The output from the boosted classifier is +1 if
the weighted sum of the individual votes is positive and −1 if it is negative.
How, then, do we learn the individual ensemble members and their coefficients? The answer to this
question depends on the specific boosting method that is used. For AdaBoost this is done by greedily
minimizing the exponential loss of the boosted classifier at each iteration.PNote that we can write the
boosted classifier after b iterations as ybboost
b = sign{C b (x)} where C b (x) = bj=1 αj ybj (x). Furthermore,
we can express the function C b (x) sequentially as

C b (x) = C b−1 (x) + αb ybb (x), (6.8)

initialized with C 0 (x) ≡ 0. Since we construct the ensemble members sequentially, when at iteration b of
the procedure the function C b−1 (x) is known and fixed. Thus, what remains to be learned at iteration b is
the coefficient αb and the ensemble member ybb (x). This is done by minimizing the exponential loss of the
training data,
n
X
(αb , ybb ) = arg min L(yi , C b (xi )) (6.9a)
(α,b
y) i=1
Xn  n o
= arg min exp −yi C b−1 (xi ) + αb
y (xi ) (6.9b)
(α,b
y) i=1
Xn
= arg min wib (−yi αb
y (xi )) (6.9c)
(α,b
y) i=1

where for the first equality we have used the definition of the exponential loss function (6.6) and the
sequential structure of the boosted classifier (6.8). For the second equality we have defined the quantities
 
def
wib = exp −yi C b−1 (xi ) , i = 1, . . . , n, (6.10)

which can be interpreted as weights for the individual data points in the training data set. Note that these
weights are independent of α and yb and can thus be viewed as constants when solving the optimization
problem (6.9c) at iteration b of the boosting procedure (the fact that we keep previous coefficients and
ensemble members fixed is what makes the optimization “greedy”).
To solve (6.9) we start by writing the objective function as
n
X n
X n
X
y (xi )) = e−α
wib exp (−yi αb wib I{yi = yb(xi )} + eα wib I{yi 6= yb(xi )}, (6.11)
i=1 i=1 i=1
| {z } | {z }
=Wc =We

where we have used the indicator function to split the sum into two sums: the first ranging over all training
data points correctly classified by yb and the second ranging over all point erroneously classified by yb.
Furthermore, for notational simplicity we define Wc and We for the sum of weights of correctly classified
and erroneously
P classified data points, respectively. Furthermore, let W = Wc + We be the total weight
sum, W = ni=1 wib .

77
6 Ensemble methods

Minimizing (6.11) is done in two stages, first w.r.t. yb and then w.r.t. α. This is possible since the
minimizing argument in yb turns out to be independent of the actual value of α > 0. To see this, note that
we can write the objective function (6.11) as

e−α W + (eα − e−α )We . (6.12)

Since the total weight sum W is independent of yb and since eα − e−α > 0 for any α > 0, minimizing this
expression w.r.t. yb is equivalent to minimizing We w.r.t. yb. That is,
n
X
b
yb = arg min wib I{yi 6= yb(xi )}. (6.13)
yb i=1

In words, the bth ensemble member should be trained by minimizing the weighted misclassification loss,
where each data point (xi , yi ) is assigned a weight wib . The intuition for these weights is that, at iteration b,
we should focus our attention on the data points previously misclassified in order to “correct the mistakes”
made by the ensemble of the first b − 1 classifiers.
How the problem (6.13) is solved in practice depends on the choice of base classifier that we use, i.e. on
the specific restrictions that we put on the function yb (for example a shallow classification tree). However,
since (6.13) is essentially a standard classification objective it can be solved, at least approximately, by
standard learning algorithms. Incorporating the weights in the objective function is straightforward for
most base classifiers, since it simply boils down to weighting the individual terms of the loss function
used when training the base classifier.
When the bth ensemble member, ybb (x), has been learned by solving (6.13) it remains to compute its
coefficient αb . Recall that this is done by minimizing the objective function (6.12). Differentiating this
expression w.r.t. α and setting the derivative to zero we get the equation

−αe−α W + α eα + e−α We = 0

⇐⇒ W = e2α + 1 We (6.14)
 
1 W
⇐⇒ α = log −1 .
2 We
Thus, by defining
n
def We X wb
b
Etrain = = Pn i b I{yi 6= ybb (xi )} (6.15)
W j=1 wj
i=1

to be the weighted misclassification error for the bth classifier, we can express the optimal value for its
coefficient as
 b

1 1 − Etrain
b
α = log b
. (6.16)
2 Etrain
This completes the derivation of the AdaBoost algorithm, which is summarized in algorithm 6. In the
algorithm we exploit the fact that the weights (6.10) can be computed recursively by using the expression
(6.8); see line 2. Furthermore, we have added an explicit weight normalization (line 2) which is convenient
in practice and which does not affect the derivation of the method above.

Remark 6.2 One detail worth commenting is that the derivation of the AdaBoost procedure assumes that
all coefficients {αb }Bb=1 are positive. To see that this is indeed the case when the coefficients are computed
according to (6.16), note that the function log((1 − x)/x) is positive for any 0 < x < 0.5. Thus, αb will
be positive as long as the weighted training error for the bth classifier, Etrain
b , is less than 0.5. That is, the

classifier just has to be slightly better than a coin flip, which is always the case in practice (note that Etrain
b

is the training error). Indeed, if Etrain


b > 0.5, then we could simply flip the sign of all predictions made by
yb (x) to reduce the error!
b

78
6.3 Boosting

Algorithm 6: AdaBoost
1. Assign weights wi1 = 1/n to all data points.

2. For b = 1 to B
(a) Train a weak classifier ybb (x) on the weighted training data {(xi , yi , wib )}ni=1 .

(b) Update the weights {wib+1 }ni=1 from {wib }ni=1 :


P
i. Compute Etrain
b = ni=1 wib I{yi 6= ybb (xi )}

ii. Compute αb = 0.5 log((1 − Etrain


b )/E b ).
train

iii. Compute wib+1 = wib exp(−αb yi ybb (xi )), i = 1, . . . , n


P
iv. Normalize. Set wib+1 ← wib+1 / nj=1 wjb+1 , for i = 1, . . . , n.
nP o
3. Output ybboost
B (x) = sign B
b=1 α by
b b (x) .

In the method discussed above we have assumed that each base classifier outputs a discrete class
prediction, ybb (x) ∈ {−1, 1}. However, many classification models used in practice are in fact based on
estimating the conditional class probabilities p(y | x) as discussed in Section 3.4.1. Hence, it is possible to
instead let each base model output a real number and use these numbers as the basis for the “vote”. This
extension of algorithm 6 is discussed by Friedman, Hastie, and Tibshirani (2000) and is referred to as Real
AdaBoost.

6.3.4 Boosting vs. bagging: base models and ensemble size


AdaBoost, and in fact any boosting algorithm, has two important design choices, (i) which base classifier
to use, an (ii) how many iterations B to run the boosting algorithm for. As previously pointed out, we can
use essentially any classification method as base classifier. However, the most common choice in practice
is to use a shallow classification tree, or even a decision stump (a tree of depth one; see Example 6.4).
This choice is guided by the fact that the boosting algorithm can learn a good model despite using
very weak base classifiers, and shallow trees can be trained quickly. In fact, using deep (high-variance)
classification trees as base classifiers typically deteriorates performance compared to using shallow trees.
More specifically, the depth of the tree should be chosen to obtain a desired degree of interactions between
input variables. A tree with M terminal nodes is able to model functions depending on maximally M − 1
of the input variables; see Hastie, Tibshirani, and Friedman (2009, Chapter 10.11) for a more in-depth
discussion on this matter.
The fact that boosting algorithms often use shallow trees as base classifiers is a clear distinction from the
(somewhat similar) bagging method. Bagging is a pure variance reduction technique based on averaging
and it can not reduce the bias of the individual base models. Hence, for bagging to be successful we
need to use base models with low bias (but possibly high variance)—typically very deep decision trees.
Boosting on the other hand can reduce both the variance and the bias of the base models, making it
possible to use very simple base models as described above.
Another important difference between bagging and boosting is that the former is parallel whereas the
latter is sequential. Each iteration of a boosting algorithm introduces a new base model aiming at reducing
the errors made by the current model. In bagging, on the other hand, all base models are identically
distributed and they all try to solve the same problem, with the final model being a simple average over
the ensemble. An effect of this is that bagging does not overfit as a result of using too many ensemble
members (see also the discussion in Section 6.1). Unfortunately, this is not the case for boosting. Indeed,
a boosting model becomes more and more flexible as the number of iterations B increase and using too

79
6 Ensemble methods

Exponential loss
Hinge loss
Binomial deviance
2 Huber-like loss
Misclassification loss
Loss

0
−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3

Margin, y · C(x)

Figure 6.3: Comparison of common loss functions for classification.

many base models can result in overfitting. It has been observed in practice, however, that this overfitting
often occurs slowly and the performance tends to be rather insensitive to the choice of B. Nevertheless, it
is a good practice to select B in some systematic way, for instance using so called early stopping (this is
common also in the context of training neural networks; see Section 7.4.4).

6.3.5 Robust loss functions and gradient boosting


As pointed out above, the margin y · C(x) can be used as a measure of the error made by the classifier
yb(x) = sign{C(x)}, where negative margins correspond to incorrect classifications and positive margins
correspond to correct classifications. It is therefore natural to use a loss function which is a decreasing
function of the margin: negative margins should be penalized more than positive margins. The exponential
loss function (6.6)—which was used in the derivation of the AdaBoost algorithm—satisfies this requirement,
as can be seen in Figure 6.2. However, this loss function also penalizes negative margins very heavily.
This can be an issue in practical applications, making the classifier sensitive to noisy data and “outliers”,
such as mislabeled or atypical data points.
To address these limitations we can consider using some other, more robust, loss function in place of
the exponential loss. A few examples of commonly used loss functions for classification are shown in
Figure 6.3 (see Section 6.A for the mathematical definitions of these functions). An in-depth discussion of
the rationale and pros and cons of these different loss functions is beyond the scope of these lecture notes
and we refer the interested reader to Hastie, Tibshirani, and Friedman (2009, Chapter 10.6). However, we
note that all the alternative loss functions illustrated in the figure have less “aggressive” penalties for large
negative margins compared to the exponential loss, i.e., their slopes are not as sharp,5 making them more
robust to noisy data.
Why then have we not used a more robust loss function in the derivation of the AdaBoost algorithm?
The reason for this is mainly computational. Using exponential loss is convenient since it leads to a closed
form solution to the optimization problem in (6.9). If we instead use another loss function this analytical
tractability is unfortunately lost.
However, this difficulty can be dealt with by using techniques from numerical optimization. This
approach is complicated to some extent by the fact that the optimization “variable” in (6.9a) is the base
classifier yb(x) itself. Hence, it is not possible to simply use an off-the-shelf numerical optimization
algorithm to solve this problem. That being said, however, it has been realized that it is possible to
approximately solve (6.9a) for rather general loss function using a method reminiscent of gradient descent
(Appendix B). The resulting method is referred to as gradient boosting Friedman 2001; Mason et al. 1999.
5
Hinge loss, binomial deviance, and the Huber-like loss all increase linearly for large negative margins. Exponential loss, of
course, increases exponentially.

80
6.3 Boosting

We provide pseudo-code for one instance of a gradient boosting method in algorithm 7. As can be seen
from the algorithm, the key step involves fitting a base model to the negative gradient of the loss function.
This can be understood via the intuitive interpretation of boosting, that each base model should try to
correct the mistakes made by the ensemble thus far. The negative gradient of the loss function gives an
indication of in which “direction” the model should be updated in order to reduce the loss.

Algorithm 7: A gradient boosting algorithm


Pn
1. Initialize (as a constant), C 0 (x) ≡ arg minc i=1 L(yi , c).

2. For b = 1 to B
(a) Compute the negative gradient of the loss function,
 
b ∂L(yi , c)
gi = − , i = 1, . . . , n.
∂c c=C b−1 (xi )

(b) Train a base regression model fbb (x) to fit the gradient values,
n 
X 2
fbb = arg min f (xi ) − gib .
f i=1

(c) Update the boosted model,

C b (x) = C b−1 (x) + γ fbb (x)

3. Output ybboost
B (x) = sign{C B (x)}.

While presented for classification in algorithm 7, gradient boosting can also be used for regression with
minor modifications. In fact, an interesting aspect of the algorithm presented here is that the base models
fbb (x) are found by solving a regression problem despite the fact that the algorithm produces a classifier.
The reason for this is that the negative gradient values {gib }ni=1 are quantitative variables, even if the data
{yi }ni=1 is qualitative. Here we have considered fitting a base model to these negative gradient values by
minimizing a square loss criterion.

The value γ used in the algorithm (line 2(c)) is a tuning parameter which plays a similar role to the
step size in ordinary gradient descent. In practice this is usually found by line search (see Appendix B),
often combined with a type of regularization via shrinkage (Friedman 2001). When using trees as base
models—as is common in practice—optimizing the steps size can be done jointly with finding the terminal
node values, resulting in a more efficient implementation (Friedman 2001).

As mentioned above, gradient boosting requires a certain amount of smoothness in the loss function.
A minimal requirement is that it is almost everywhere differentiable, so that it is possible to compute
the gradient of the loss function. However, some implementations of gradient boosting require stronger
conditions, such as second order differentiability. The binomial deviance (see Figure 6.3) is in this respect
a “safe choice” which is infinitely differentiable and strongly convex, while still enjoying good statistical
properties. As a consequence, binomial deviance is one of the most commonly used loss functions in
practice.

81
6 Ensemble methods

6.A Classification loss functions


The classification loss functions illustrated in Figure 6.3 are:

Exponential loss: L(y, c) = exp(−yc).


(
1 − yc for yc < 1,
Hinge loss: L(y, c) =
0 otherwise.
Binomial deviance: L(y, c) = log(1 + exp(−yc)).


−yc for yc < −1,
Huber-like loss: L(y, c) = 4 (1 − yc) for − 1 ≤ yc ≤ 0,
1 2


0 otherwise.
(
1 for yc < 0,
Misclassification loss: L(y, c) =
0 otherwise.

82
7 Neural networks and deep learning
Neural networks can be used for both regression and classification, and they can be seen as an extension of
linear regression and logistic regression, respectively. Traditionally neural networks with one so-called
hidden layer have been used and analysed, and several success stories came in the 1980s and early 1990s.
In the 2000s it was, however, realized that deep neural networks with several hidden layers, or simply deep
learning, are even more powerful. With the combination of new software, hardware, parallel algorithms
for training and a lot of training data, deep learning has made a major contribution to machine learning.
Deep learning has excelled in many applications, including image classification, speech recognition and
language translation. New applications, analysis, and algorithmic developments to deep learning are
published literally every day.
We will start in Section 7.1 by generalizing linear regression to a two-layer neural network (i.e., a neural
network with one hidden layer), and then generalize it further to a deep neural network. We thereafter
leave regression and look at the classification setting in Section 7.2. In Section 7.3 we present a special
neural network tailored for images and finally we look in to some of the details on how to train neural
networks in Section 7.4.

7.1 Neural networks for regression


A neural network is a nonlinear function that describes the output variable y as a nonlinear function of its
input variables

y = f (x1 , . . . , xp ; θ) + , (7.1)

where  is an error term and the function f is parametrized by θ. Such a nonlinear function can be
parametrized in many ways. In a neural network, the strategy is to use several layers of linear regression
models and nonlinear activation functions. We will explain this carefully in turn below. For the model
description it will be convenient to define z as the output without the noise term ,

z = f (x1 , . . . , xp ; θ). (7.2)

7.1.1 Generalized linear regression


We start the description with a graphical illustration of the linear regression model

z = β0 1 + β1 x1 + β2 x2 + · · · + βp xp , (7.3)

which is shown in Figure 7.1a. Each input variable xj is represented with a node and each parameter βj
with a link. Furthermore, the output z is described as the sum of all terms βi xj . Note that we use 1 as the
input variable corresponding to the offset term β0 .
To describe nonlinear relationships between x = [1 x1 x2 . . . xp ]T and z we introduce a nonlinear
scalar function called the activation function σ : R → R. The linear regression model (7.3) is now
modified into a generalized linear regresssion model where the linear combination of the inputs is squashed
through the (scalar) activation function

z = σ β0 + β1 x1 + β2 x2 + · · · + βp xp . (7.4)

This extension to the generalized linear regression model is visualized in Figure 7.1b.

83
7 Neural networks and deep learning

1 β0 1 β0
x1 x1
.. z .. σ z
. .
xp βp xp βp
(a) (b)

Figure 7.1: Graphical illustration of a linear regression model (Figure 7.1a), and a generalized linear regression
model (Figure 7.1b). In Figure 7.1a, the output z is described as the sum of all terms β0 and {βj xj }pj=1 , as in (7.3).
In Figure 7.1b, the circle denotes addition and also transformation through the activation function σ, as in (7.4).

1 σ(x) 1 σ(x)

x x

−6 6 −1 1

Logistic: σ(x) = 1
1+e−x ReLU: σ(x) = max(0, x)
(a) (b)

Figure 7.2: Two common activation functions used in neural networks. The logistic (or sigmoid) function
(Figure 7.2a), and the rectified linear unit (Figure 7.2b).

Common choices for activation function are the logistic function and the rectified linear unit (ReLU).
These are illustrated in Figure 7.2a and Figure 7.2b, respectively. The logistic (or sigmoid) function has
already been used in the context of logistic regression (Section 3.2). The logistic function is affine close to
x = 0 and saturates at 0 and 1 as x decreases or increases. The ReLU is even simpler. The function is the
identity function for positive inputs and equal to zero for negative inputs.
The logistic function used to be the standard choice of activation function in neural networks for many
years, whereas the ReLU has gained in popularity (despite its simplicity!) during recent years and it is
now the standard choice in most neural network models.
The generalized linear regression model (7.4) is very simple and is itself not capable of describing
very complicated relationships between the input x and the output z. Therefore, we make two further
extensions to increase the generality of the model: We will first make use of several generalized linear
regression models to build a layer (which will lead us to the two-layer neural network) and then stack these
layers in a sequential construction (which will result in a deep neural network, or simply deep learning).

7.1.2 Two-layer neural network

In (7.4), the output is constructed by one scalar regression model. To increase its flexibility and turn it into
a two-layer neural network, we instead let the output be a sum of M such generalized linear regression
models, each of which has its own parameters. The parameter for the ith regression model are β0i , . . . , βpi
and we denote its output by hi ,

hi = σ (β0i + β1i x1 + β2i x2 + · · · + βpi xp ) , i = 1, . . . , M. (7.5)

These intermediate outputs hi are so-called hidden units, since they are not the output of the whole model.
The M different hidden units {hi }M i=1 instead act as input variables to an additional linear regression
model

z = β0 + β1 h1 + β2 h2 + · · · + βM hM . (7.6)

To distinguish the parameters in (7.5) and (7.6) we add the superscripts (1) and (2), respectively. The
equations describing this two-layer neural network (or equivalently, neural network with one layer of

84
7.1 Neural networks for regression

Input variables Hidden units Output

(1) 1
1 β01 (2)
β0
σ h1
x1
σ z
..
. ..
. (2)
xp βM
hM
(1) σ
βpM

Figure 7.3: A two-layer neural network, or equivalently, a neural network with one intermediate layer of hidden
units.

hidden units) are thus


 
(1) (1) (1) (1)
h1 = σ β01 + β11 x1 + β21 x2 + · · · + βp1 xp ,
 
(1) (1) (1) (1)
h2 = σ β02 + β12 x1 + β22 x2 + · · · + βp2 xp , (7.7a)
..
.
 
(1) (1) (1) (1)
hM = σ β0M + β1M x1 + β2M x2 + · · · + βpM xp ,

(2) (2) (2) (2)


z = β0 + β1 h1 + β2 h2 + · · · + βM hM . (7.7b)

Extending the graphical illustration from Figure 7.1, this model can be depicted as a graph with two-layers
of links (illustrated using arrows), see Figure 7.3. As before, each link has a parameter associated with it.
Note that we include an offset term not only in the input layer, but also in the hidden layer.

7.1.3 Matrix notation


The two-layer neural network model in (7.7) can also be written more compactly using matrix notation,
where the parameters in each layer are stacked in a weight matrix W and an offset vector1 b as
   (2) 
(1) (1)
h i β11 . . . β1M h i β1
 . .   .. 
b = β01 . . . β0M , W =  ..
(1) (1) (1) (1)  ... ..  (2) (2)
 , b = β0 , W =  .  .
(2)

(1) (1) (2)


βp1 ... βpM βM
(7.8)

The full model can then be written as


 
h = σ W(1)T x + b(1)T , (7.9a)
z = W(2)T h + b(2)T , (7.9b)

where we have also stacked the components in x and h as x = [x1 , . . . , xp ]T and h = [h1 , . . . , hM ]T .
The activation function σ acts element-wise. The two weight matrices and the two offset vectors will be
1
The word “bias” is often used for the offset vector in the neural network literature, but this is really just a model parameter and
not a bias in the statistical sense. To avoid confusion we refer to it as an offset instead.

85
7 Neural networks and deep learning

the parameters in the model, which can be written as


 T
θ = vec(W(1) )T vec(W(2) )T b(1) b(2) . (7.10)

By this we have described a nonlinear regression model on the form y = f (x; θ) +  where f (x; θ) = z
according to above. Note that the predicted output z in (7.9b) depends on all the parameters in θ even
though it is not explicitly stated in the notation.

7.1.4 Deep neural network

The two-layer neural network is a useful model on its own, and a lot of research and analysis has been
done for it. However, the real descriptive power of a neural network is realized when we stack multiple
such layers of generalized linear regression models, and thereby achieve a deep neural network. Deep
neural networks can model complicated relationships (such as the one between an image and its class),
and is one of the state-of-the-art methods in machine learning as of today.
We enumerate the layers with index l. Each layer is parametrized with a weight matrix W(l) and an
offset vector b(l) , as for the two-layer case. For example, W(1) and b(1) belong to layer l = 1, W(2) and
b(2) belong to layer l = 2 and so forth. We also have multiple layers of hidden units denoted by h(l−1) .
(l) (l)
Each such layer consists of Ml hidden units h(l) = [h1 , . . . , hMl ]T , where the dimensions M1 , M2 , . . .
can be different for different layers.
Each layer maps a hidden layer h(l−1) to the next hidden layer h(l) as

h(l) = σ(W(l)T h(l−1) + b(l)T ). (7.11)

This means that the layers are stacked such that the output of the first layer h(1) (the first layer of hidden
units) is the input to the second layer, the output of the second layer h(2) (the second layer of hidden units)
is the input to the third layer, etc. By stacking multiple layers we have constructed a deep neural network.
A deep neural network of L layers can mathematically be described as (cf. (7.9))

h(1) = σ(W(1)T x + b(1)T ),


h(2) = σ(W(2)T h(1) + b(2)T ),
..
. (7.12)
(L−1) (L−1)T (L−2) (L−1)T
h = σ(W h +b ),
(L)T (L−1) (L)T
z=W h +b .

A graphical representation of this model is represented in Figure 7.4.


The weight matrix W(1) for the first layer l = 1 has the dimension p × M1 and the corresponding offset
vector b(1) the dimension 1 × M1 . In deep learning it is common to consider applications where also the
output is multi-dimensional z = [z1 , . . . , zK ]T . This means that for the last layer the weight matrix W(L)
has the dimension ML−1 × K and the offset vector b(L) the dimension 1 × K. For all intermediate layers
l = 2, . . . , L − 1, W(l) has the dimension Ml−1 × Ml and the corresponding offset vector 1 × Ml .
The number of inputs p and the number of outputs K are given by the problem, but the number of layers
L and the dimensions M1 , M2 , . . . are user design choices that will determine the flexibility of the model.

7.1.5 Learning the network from data

Analogously to the parametric models presented earlier (e.g. linear regression and logistic regression) we
need to learn all the parameters in order to use the model. For deep neural networks the parameters are
 T
θ = vec(W(1) )T vec(W(2) )T · · · vec(W(L) )T b(1)T b(2) · · · b(L) (7.13)

86
7.1 Neural networks for regression

Input Hidden Hidden Hidden Hidden Outputs


variables units units units units
1 1 1 1
1 z1
σ h(1)
1
σ h(2)
1
σ h(L−2)
1
σ h(L−1)
1
x1
..
..
σ σ ... σ σ .
. .. .. .. .. ..
. . . . . zK
xp (1) (2) (L−2) (L−1)
hM1 hM2 hML−2 hML−1
σ σ σ σ
Layer 1 Layer 2 Layer L-1 Layer L
W(1) b(1) W(2) b(2) W b (L−1) (L)
(L−1)
W b(L)

Figure 7.4: A deep neural network with L layers. Each layer is parameterized with W(l) and b(l) .

The wider and deeper the network is, the more parameters there are. Practical deep neural networks can
easily have in the order of millions of parameters and these models are therefore also extremely flexible.
Hence, some mechanism to avoid overfitting is needed. Regularization such as ridge regression is common
(cf. Section 2.6), but there are also other techniques specific to deep learning; see further Section 7.4.4.
Furthermore, the more parameters there are, the more computational power is needed to train the model.
As before, the training data T = {(xi , yi )}ni=1 consists of n samples of the input x and the output y.
For a regression
 problem we typically start with maximum likelihood and assume Gaussian noise
 ∼ N 0, σ2 , and thereby obtain the square error loss function as in Section 2.3.1,
X n
b = arg min 1
θ L(xi , yi , θ) where L(xi , yi , θ) = kyi − f (xi ; θ)k2 = kyi − zi k2 .
θ n
i=1
(7.14)
This problem can be solved with numerical optimization, and more precisely stochastic gradient descent.
This is described in more detail in Section 7.4.
From the model, the parameters θ, and the inputs {xi }ni=1 we can compute the predicted outputs {zi }ni=1
using the model zi = f (xi ; θ). For example, for the two-layer neural network presented in Section 7.1.2
we have
hT T
i = σ(xi W
(1)
+ b(1) ), (7.15a)
zT T
i = hi W
(2)
+ b(2) . (7.15b)
In (7.15) the equations are transposed in comparison to the model in (7.9). This is a small trick such that
we easily can extend (7.15) to include multiple data points i. Similar to (2.4) we stack all data points in
matrices, where each data point represents one row
 T  T  T  T
y1 x1 z1 h1
 ..   ..   ..   .. 
Y =  . , X =  .  , Z =  .  , and H =  .  . (7.16)
ynT xn T znT T
hn
We can then write (7.15) as
H = σ(XW(1) + b(1) ), (7.17a)
Z = HW (2)
+b (2)
, (7.17b)
where we also have stacked the predicted output and the hidden units in matrices. This is also how the
model would be implemented in code. In Tensorflow, which will be used in the laboratory work in the
course, it can be written as

87
7 Neural networks and deep learning

Input Hidden Hidden Hidden Hidden Logits Outputs


variables units units units units
1 1 1 1
1 z1 p(1 | x; θ)
σ h(1)
1
σ h(2)
1
σ h(L−2)
1
σ h(L−1)
1
x1
.. .. ..
..
σ σ ... σ σ . . .
. .. .. .. .. ..
. . . . . zK p(K | x; θ)
xp (1) (2) (L−2) (L−1)
hM1 hM2 hML−2 hML−1
σ σ σ σ
Layer 1 Layer 2 Layer L-1 Layer L Softmax
W(1) b(1) W(2) b(2) W b (L−1) (L)
(L−1)
W b(L)

Figure 7.5: A deep neural network with L layers for classification. The only difference to regression (Figure 7.4) is
the softmax transformation after layer L.

H = tf.sigmoid(tf.matmul(X, W1) + b1)


Z = tf.matmul(H, W2) + b2

Note that in (7.17) the offset vectors b1 and b2 are added and broadcasted to each row. See more details
regarding implementation of a neural network in the instructions for the laboratory work.

7.2 Neural networks for classification

Neural networks can also be used for classification where we have qualitative outputs y ∈ {1, . . . , K}
instead of quantitative. In Section 3.2 we extended linear regression to logistic regression by adding the
logistic function to the output. In the same manner we can extend the neural network presented in the last
section to a neural network for classification. In doing this extension, we will use the multi-class version
of logistic regression presented in Section 3.2.3, and more specifically the softmax parametrization given
in (3.13), repeated here for convenience

1  z T
softmax(z) = PK e1 ··· e zK . (7.18)
zj
j=1 e

The softmax function acts as an additional activation function on the final layer of the neural network.
In addition to the regression network in (7.12) we add the softmax function at the end of the network as

..
.
z = W(L)T H (L−1) + b(L)T , (7.19a)
 T
p(1 | xi ) p(2 | xi ) . . . p(K | xi ) = softmax(z) (7.19b)

The softmax function thus maps the output of the last layer z1 , . . . , zK to the modeled class probabilities
p(1 | xi ), . . . , p(K | xi ), see also Figure 7.5 for a graphical illustration. The inputs to the softmax function,
i.e. the variables z1 , . . . , zK , are referred to as logits.
Note that the softmax function does not come as a layer with additional parameters, it only transforms
the previous output [z1 . . . zK ]T ∈ RK to the modeled probabilities [p(1 | xi ) . . . p(K | xi )]T ∈ [0, 1]K .
Also note that by construction of the softmax function, these values will sum to 1 regardless of the values
of [z1 , . . . , zK ] (otherwise it would not be probabilities).

88
7.3 Convolutional neural networks

k=1 k=2 k=3 k=1 k=2 k=3


yik 0 1 0 yik 0 1 0
p(k | xi ; θ A ) 0.1 0.8 0.1 p(k | xi ; θ B ) 0.8 0.1 0.1
Cross-entropy: Cross-entropy:
L(xi , yi , θ A ) = −1 · log 0.8 = 0.22 L(xi , yi , θ B ) = −1 · log 0.1 = 2.30
Figure 7.6: Illustration of the cross-entropy between a data point yi and two different prediction outputs.

7.2.1 Learning classification networks from data


As before, the training data consists of n samples of inputs and outputs {(xi , yi )}ni=1 . For the classification
problem we use the one-hot encoding for the output yi . This means that for a problem with K different
 T
classes, yi consists of K elements yi = yi1 . . . yiK . If a data point i belongs to class k then
yik = 1 and yij = 0 for all j 6= k. See more about the one-hot encoding in Section 3.2.3.
For a neural network with the softmax activation function on the final layer we typically use the negative
log-likelihood, which is also commonly referred to as the cross-entropy loss function, to train the model
(cf. (3.16))
X n K
X
b = arg min 1
θ L(xi , yi , θ) where L(xi , yi , θ) = − yik log p(k | xi ; θ). (7.20)
θ n
i=1 k=1

The cross-entropy is close to its minimum if the predicted probability p(k | xi ; θ) is close to 1 for the
k for which yik = 1. For example, if the ith data point belongs to class k = 2 out of in total K = 3
 T
classes we have yi = 0 1 0 . Assume that we have a set of parameters of the network denoted θ A ,
and with these parameters we predict p(1 | xi ; θ A ) = 0.1, p(2 | xi ; θ A ) = 0.8 and p(3 | xi ; θ A ) = 0.1
meaning that we are quite sure that data point i actually belongs to class k = 2. This would generate a
low cross-entropy L(xi , yi , θ A ) = −(0 · log 0.1 + 1 · log 0.8 + 0 · log 0.1) = 0.22. If we instead predict
p(1 | xi ; θ B ) = 0.8, p(2 | xi ; θ B ) = 0.1 and p(3 | xi ; θ B ) = 0.1, the cross-entropy would be much higher
L(xi , yi , θ B ) = −(0 · log 0.8 + 1 · log 0.1 + 0 · log 0.1) = 2.30. For this case, we would indeed prefer
the parameters θ A over θ B . This is summarized in Figure 7.6.
Computing the loss function explicitly via the logarithm could lead to numerical problems when
p(k | xi ; θ) is close to zero since log(x) → −∞ as x → 0. This can be avoided since the logarithm in the
cross-entropy loss function (7.20) can “undo” the exponential in the softmax function (7.18),
K
X K
X
L(xi , yi , θ) = − yik log p(k | xi ; θ) = − yik log[softmax(zi )]k
k=1 k=1
XK  nP o
=− yik zik − log K
j=1 e
zij , (7.21)
k=1
XK  nP o
=− yik zik − max zij − log K
j=1 e
z ij −max z
j ij , (7.22)
j
k=1
where zik are the logits.

7.3 Convolutional neural networks


Convolutional neural networks (CNN) are a special kind neural networks tailored for problems where
the input data has a grid-like topology. In this text we will focus on images, which have a 2D-topology
of pixels. Images are also the most common type of input data in applications where CNNs are applied.
However, CNNs can be used for any input data on a grid, also in 1D (e.g. audio waveform data) and 3D
(volumetric data e.g. CT scans or video data). We will focus on grayscale images, but the approach can
easily be extended to color images as well.

89
7 Neural networks and deep learning

Image Data representation Input variables

0.0 0.0 0.8 0.9 0.6 0.0 x1,1 x1,2 x1,3 x1,4 x1,5 x1,6

0.0 0.9 0.6 0.0 0.8 0.0 x2,1 x2,2 x2,3 x2,4 x2,5 x2,6

0.0 0.0 0.0 0.0 0.9 0.0 x3,1 x3,2 x3,3 x3,4 x3,5 x3,6

0.0 0.0 0.0 0.9 0.6 0.0 x4,1 x4,2 x4,3 x4,4 x4,5 x4,6

0.0 0.0 0.9 0.0 0.0 0.0 x5,1 x5,2 x5,3 x5,4 x5,5 x5,6

0.0 0.8 0.9 0.9 0.9 0.9 x6,1 x6,2 x6,3 x6,4 x6,5 x6,6

Figure 7.7: Data representation of a grayscale image with 6 × 6 pixels. Each pixel is represented with a number
describing the grayscale color. We denote the whole image as X (a matrix), and each pixel value is an input variable
xj,k (element in the matrix X).

Input variables Hidden units Input variables Hidden units

(1) (1)
x1,1 x1,2 x1,3 x1,4 x1,5 x1,6 β1,3 σ σ σ σ σ σ x1,1 x1,2 x1,3 x1,4 x1,5 x1,6 β1,3 σ σ σ σ σ σ
x2,1 x2,2 x2,3 x2,4 x2,5 x2,6 σ σ σ σ σ σ x2,1 x2,2 x2,3 x2,4 x2,5 x2,6 σ σ σ σ σ σ
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6 β (1) σ σ σ σ σ σ x3,1 x3,2 x3,3 x3,4 x3,5 x3,6 (1)
β3,3 σ σ σ σ σ σ
3,3
x4,1 x4,2 x4,3 x4,4 x4,5 x4,6 σ σ σ σ σ σ x4,1 x4,2 x4,3 x4,4 x4,5 x4,6 σ σ σ σ σ σ
x5,1 x5,2 x5,3 x5,4 x5,5 x5,6 σ σ σ σ σ σ x5,1 x5,2 x5,3 x5,4 x5,5 x5,6 σ σ σ σ σ σ
x6,1 x6,2 x6,3 x6,4 x6,5 x6,6 σ σ σ σ σ σ x6,1 x6,2 x6,3 x6,4 x6,5 x6,6 σ σ σ σ σ σ

(a) (b)

Figure 7.8: An illustration of the interactions in a convolutional layer: Each hidden unit (circle) is only dependent
on the pixels in a small region of the image (red boxes), here of size 3 × 3 pixels. The location of the hidden
unit corresponds to the location of the region in the image: if we move to a hidden unit one step to the right, the
corresponding region also moves one step to the right, compare Figure 7.8a and Figure 7.8b. Furthermore, the nine
parameters β1,1 , β1,2 , . . . , β3,3 are the same for all hidden units in the layer.
(1) (1) (1)

7.3.1 Data representation of an image


Digital grayscale images consist of pixels ordered in a matrix. Each pixel can be represented as a range
from 0 (total absence, black) to 1 (total presence, white) and values between 0 and 1 represent different
shades of gray. In Figure 7.7 this is illustrated for an image with 6 × 6 pixels. In an image classification
problem, an image is the input x and the pixels in the image are the input variables x1,1 , x1,2 , . . . , x6,6 .
The two indices j and k determine the position of the pixel in the image, as illustrated in Figure 7.7.
If we put all input variables representing the images pixels in a long vector, we can use the network
architecture presented in Section 7.1 and 7.2 (and that is what we will do in the laboratory work to start
with!). However, by doing that, a lot of the structure present in the image data will be lost. For example,
we know that two pixels close to each other have more in common than two pixels further apart. This
information would be destroyed by such a vectorization. In contrast, CNNs preserve this information by
representing the input variables as well as the hidden layers as matrices. The core component in a CNN is
the convolutional layer, which will be explained next.

7.3.2 The convolutional layer


Following the input layer, we use a hidden layer with equally many hidden units as input variables. For the
image with 6 × 6 pixels we consequently have 6 × 6 = 36 hidden units. We choose to order the hidden
units in a 6 × 6 matrix, i.e. in the same manner as we did for the input variables, see Figure 7.8a.
The network layers presented in earlier sections (like the one in Figure 7.3) have been dense layer. This
means that each input variable is connected to each hidden unit in the following layer, and each such
connection has a unique parameter βjk associated with it. These layers have empirically been found to

90
7.3 Convolutional neural networks

Input variables Hidden units


(1)
0 0 0 β1,3

0 x1,1 x1,2 x1,3 x1,4 x1,5 x1,6 σ σ σ σ σ σ


(1)
0 x2,1 x2,2 x2,3 x2,4 x2,5 x2,6 β3,3 σ σ σ σ σ σ
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6 σ σ σ σ σ σ
x4,1 x4,2 x4,3 x4,4 x4,5 x4,6 σ σ σ σ σ σ
x5,1 x5,2 x5,3 x5,4 x5,5 x5,6 σ σ σ σ σ σ
x6,1 x6,2 x6,3 x6,4 x6,5 x6,6 σ σ σ σ σ σ

Figure 7.9: An illustration of zero-padding. If the region is partly is outside the image. With zero-padding, the size
of the image can be preserved in the following layer.

provide too much flexibility for images and we might not be able to capture the patterns of real importance,
and hence not generalize and perform well on unseen data. Instead, a convolutional layer appears to exploit
the structure present in images to find a more efficiently parameterized model. In contrast to a dense layer,
a convolutional layer leverages two important concepts – sparse interactions and parameter sharing – to
achieve such a parametrization.

Sparse interactions

With sparse interactions we mean that most of the parameters in a dense layer are forced to be equal to
zero. More specifically, a hidden unit in a convolutional layer only depends on the pixels in a small region
of the image and not on all pixels. In Figure 7.8 this region is of size 3 × 3. The position of the region
is related to the position of the hidden unit in its matrix topology. If we move to the hidden unit one
step to the right, the corresponding region in the image also moves one step to the right, as displayed by
comparing Figure 7.8a and Figure 7.8b. For the hidden units on the border, the corresponding region is
partly located outside the image. For these border cases, we typically use zero-padding where the missing
pixels are replaced with zeros. Zero-padding is illustrated in Figure 7.9.

Parameter sharing

In a dense layer each link between an input variable and a hidden unit has its own unique parameter. With
parameter sharing we instead let the same parameter be present in multiple places in the network. In a
convolutional layer the set of parameters for the different hidden units are all the same. For example, in
Figure 7.8a we use the same set of parameters to map the 3 × 3 region of pixels to the hidden unit as we do
in Figure 7.8b. Instead of learning separate sets of parameters for every position we only learn one such set
of a few parameters, and use it for all links between the input layer and the hidden units. We call this set of
parameters a kernel. The mapping between the input variables and the hidden units can be interpreted as a
convolution between the input variables and the kernel, hence the name convolutional neural network.
The sparse interactions and parameter sharing in a convolutional layer makes the CNN fairly invariant to
translations of objects in the image. If the parameters in the kernel are sensitive to a certain detail (such as
a corner, an edge, etc.) a hidden unit will react to this detail (or not) regardless of where in the image that
detail is present! Furthermore, a convolutional layer uses a lot fewer parameters than the corresponding
dense layer. In Figure 7.8 only 3 · 3 + 1 = 10 parameters are required (including the offset parameter). If
we had used a dense layer (36 + 1) · 36 = 1332 parameters would have been needed! Another way of
interpreting this is: with the same amount of parameters, a convolutional layer can encode more properties
of an image than a dense layer.

7.3.3 Condensing information with strides


In the convolutional layer presented above we have equally many hidden units as we have pixels in the
image. As we add more layers to the CNN we usually want to condense the information by reducing the
number of hidden units at each layer. One way of doing this is by not applying the kernel to every pixel

91
7 Neural networks and deep learning

Input variables Hidden units Input variables Hidden units

x1,1 x1,2 x1,3 x1,4 x1,5 x1,6 (1) x1,1 x1,2 x1,3 x1,4 x1,5 x1,6 (1)
β1,3 β1,3
x2,1 x2,2 x2,3 x2,4 x2,5 x2,6 x2,1 x2,2 x2,3 x2,4 x2,5 x2,6
σ σ σ σ σ σ
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6 (1)
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6 (1)
β3,3 σ σ σ β3,3 σ σ σ
x4,1 x4,2 x4,3 x4,4 x4,5 x4,6 x4,1 x4,2 x4,3 x4,4 x4,5 x4,6
σ σ σ σ σ σ
x5,1 x5,2 x5,3 x5,4 x5,5 x5,6 x5,1 x5,2 x5,3 x5,4 x5,5 x5,6
x6,1 x6,2 x6,3 x6,4 x6,5 x6,6 x6,1 x6,2 x6,3 x6,4 x6,5 x6,6

(a) (b)

Figure 7.10: A convolutional layer with stride [2,2] and kernel of size 3 × 3.

but to say every two pixels. If we apply the kernel to every two pixels both row-wise and column-wise, the
hidden units will only have half as many rows and half as many columns. For a 6 × 6 image we get 3 × 3
hidden units. This concept is illustrated in Figure 7.10.
The stride controls how many pixels the kernel shifts over the image at each step. In Figure 7.8 the
stride is [1,1] since the kernel moves by one pixel both row- and column-wise. In Figure 7.10 the stride is
[2,2] since it moves by two pixels row- and column-wise. Note that the convolutional layer in Figure 7.10
still requires 10 parameters, as the convolutional layer in Figure 7.8 does. Another way of condensing the
information after a convolutional layer is by subsampling the data, so-called pooling. The interested can
read further about pooling in Goodfellow, Bengio, and Courville 2016.

7.3.4 Multiple channels

The networks presented in Figure 7.8 and 7.10 only have 10 parameters each. Even though this
parameterization comes with a lot of advantages, one kernel is probably not sufficient to encode all
interesting properties of the images in our data set. To extended the network, we add multiple kernels, each
with their own set of kernel parameters. Each kernel produces its own set of hidden units—a so-called
channel—using the same convolution operation as explained in Section 7.3.2. Hence, each layer of
hidden units in a CNN are organized into a tensor with the dimensions (rows × columns × channels).
In Figure 7.11, the first layer of hidden units has four channels and that hidden layer consequently has
dimension 6 × 6 × 4.
When we continue to stack convolutional layers, each kernel depends not only on one channel, but on
all the channels in the previous layer. This is displayed in the second convolutional layer in Figure 7.11.
As a consequence, each kernel is a tensor of dimension (kernel rows × kernel columns × input channels).
For example, each kernel in the second convolutional layer in Figure 7.11 is of size 3 × 3 × 4. If we
collect all kernels parameters in one weight tensor W , that tensor will be of dimension (kernel rows ×
kernel columns × input channels × output channels). In the second convolutional layer in Figure 7.11, the
corresponding weight matrix W(2) is a tensor of dimension 3 × 3 × 4 × 6. With multiple kernels in each
convolutional layer, each of them can be sensitive to different features in the image, such as certain edges,
lines or circles enabling a rich representation of the images in our training data.

7.3.5 Full CNN architecture

A full CNN architecture consists of multiple convolutional layers. Typically, we decrease the number of
rows and columns in the hidden layers as we proceed though the network, but instead increase the number
of channels to enable the network to encode more high level features. After a few convolutional layers we
usually end a CNN with one or more dense layers. If we consider an image classification task, we place a
softmax layer at the very end to get outputs in the range [0,1]. The loss function when training a CNN will
be the same as in the regression and classification networks explained earlier, depending on which type of
problem we have at hand. In Figure 7.11 a small example of a full CNN architecture is displayed.

92
7.4 Training a neural network

Input variables Hidden units Hidden units Hidden units Logits Outputs
dim 6 × 6 × 1 dim 6 × 6 × 4 dim 3 × 3 × 6 dim M3 dim K dim K
(1)
x1,1 x1,2 x1,3 x1,4 x1,5 x1,6 β1,3,1 σ σ σ σ σ σ
σ σ σ σ σ σ
σ σ σ σ σ σ
x2,1 x2,2 x2,3 x2,4 x2,5 x2,6 σ σσ σσ σσ σσ σσ σ (2) σ
σ σ σ σ σ σ β1,3,4,6σ σ σ
σ σ σ σ σ σ σ σ σ z1 p(1 | x; θ)
x3,1 x3,2 x3,3 x3,4 x3,5 x3,6 β (1) σ σσ σσ σσ σσ σσ σ σ σ σ
3,3,1 σ σ σ σ σ σ σ σσσ σσσ σσ σ
σ σ σ σ σ σ σ σσ σσ σ .. .. ..
x4,1 x4,2 x4,3 x4,4 x4,5 x4,6 (1) σ σσ σσ σσ σσ σσ σ σ σ σ . . .
β1,3,4 σ σ σ σ σ σ
σ σ σ σ σ σ β3,3,4,6σσ σσσσ σσσσ σσ
(2) ..
x5,1 x5,2 x5,3 x5,4 x5,5 x5,6 σ σσ σσ σσ σσ σσ σ σ σσ σσ σ .
σ σ σ σ σ σ σ σ σ zK p(K | x; θ)
σ σ σ σ σ σ σ σ σ
x6,1 x6,2 x6,3 x6,4 x6,5 x6,6 σ σσ σσ σσ σσ σσ σ σ σ σ σ
(1) σ σ σ σ σ σ
β3,3,4 σ σ σ σ σ σ
σ σ σ σ σ σ

Convolutional Convolutional Dense Dense Softmax


layer layer layer layer
W(1) ∈ R3×3×1×4 W(2) ∈ R3×3×4×6 W(3) ∈ R54×M3 W(4) ∈ RM3 ×K
b(1) ∈ R4 b(2) ∈ R6 b(3) ∈ RM3 b(4) ∈ RK

Figure 7.11: A full CNN architecture for classification of grayscale 6 × 6 images. In the first convolutional layer
four kernels, each of size 3 × 3, produce a hidden layer with four channels. The first channel (in the bottom) is
visualized in red and the forth (on the top in blue). We use the stride [1,1] which maintains the number of rows and
columns. In the second convolutional layer, six kernels of size 3 × 3 × 4 and the stride [2,2] are used. They produce
a hidden layer with 3 rows, 3 columns and 6 channels. After the two convolutional layers follows a dense layer
where all 3 · 3 · 6 = 54 hidden units in the second hidden layer are densely connected to the third layer of hidden
units where all links have their unique parameters. We add an additional dense layer mapping down to the K logits.
The network ends with a softmax function to provide predicted class probabilities as output.

7.4 Training a neural network

To use a neural network for prediction we need to find an estimate for the parameters θ.
b To do that we
solve an optimization problem on the form
n
1X
b = arg min J(θ)
θ where J(θ) = L(xi , yi , θ). (7.23)
θ n
i=1

We denote J(θ) as the cost function and L(xi , yi , θ) as the loss function. The functional form of the loss
function depends on if we have regression or a classification problem at hand, see e.g. (7.14) and (7.20).
These optimization problems can not be solved in closed form, so numerical optimization has to be
used. In Appendix B, an introduction to numerical optimization is provided. In all numerical optimization
algorithms the parameters are updated in an iterative manner. In deep learning we typically use various
versions of gradient descent:

1. Pick an initialization θ 0 .
2. Update the parameters as θ t+1 = θ t − γ∇θ J(θ t ) for t = 1, 2, . . . . (7.24)
3. Terminate when some criterion is fulfilled, and take the last θ t as θ.
b

In many applications of deep learning we cannot afford to compute the exact gradient ∇θ J(θ t ) at each
iteration. Instead we use approximations, which are explained in Section 7.4.2. In Section 7.4.3 strategies
how to tune the learning rate γ are presented and in Section 7.4.4 a popular regularization method called
dropout is described. First, however, a few words on how to initialize the training.

7.4.1 Initialization
The previous optimization problems (LASSO (2.31), logistic regression (3.8)) that we have encountered
have all been convex. This means that we can guarantee global convergence regardless of what initialization
θ 0 we use. In contrast, the cost functions for training neural networks is usually non-convex. This means
that the training is sensitive to the value of the initial parameters. Typically, we initialize all the parameters

93
7 Neural networks and deep learning

to small random numbers such that we ensure that the different hidden units encode different aspects of
the data. If the ReLU activation functions are used, offset elements b0 are typically initialized to a small
positive value such that it operates in the non-negative range of the ReLU.

7.4.2 Stochastic gradient descent

Many problems that are addressed with deep learning contain more than a million training data points n,
and the design of the neural network is typically made such that θ has more than a million elements. This
provides a computational challenge.
A crucial component is the computation of the gradient required in the optimization routine (7.24)
n
1X
∇θ J(θ) = ∇θ L(xi , yi , θ). (7.25)
n
i=1

If the number of data points n is big, this operation is costly. However, we can often assume that the data
set is redundant meaning that many of the data points are similar. Then the gradient based on the first half
P n2
of the dataset ∇θ J(θ) ≈ i=1 P∇ θ L(xi , yi , θ) is almost identical to the gradient based on the second
half of the dataset ∇θ J(θ) ≈ i= n +1 ∇θ L(xi , yi , θ). Consequently, it is a waste of time of compute
n
2
the gradient based on the whole data set. Instead, we could compute the gradient based on the first half of
the data set, update the parameters, and then get the gradient for the new parameters based on the second
half of the data,
n
1 X2

θ t+1 = θt − γ ∇θ L(xi , yi , θ t ), (7.26a)


n/2
i=1
n
1 X
θ t+2 = θ t+1 − γ ∇θ L(xi , yi , θ t+1 ). (7.26b)
n/2 n
i= 2 +1

These two steps would only require roughly half the computational time in comparison to if we had used
the whole data set for each gradient computation.
The extreme version of this strategy would be to use only a single data point each time when computing
the gradient. However, most commonly when training a deep neural network we do something in between,
using more than one data point but not all data points when computing the gradient We use a smaller set
of training data called a mini-batch. Typically, a mini-batch contains nb = 10, nb = 100 or nb = 1000
data points.
One important aspect when using mini-batches is that the different mini-batches are balanced and
representative for the whole data set. For example, if we have a big training data set with a few different
classes and the data set is sorted after the classes (i.e. samples belonging to class k = 1 are first, and so on),
a mini-batch with first nb samples would only include one class and hence not give a good approximation
of the gradient for the full data set.
For this reason, we prefer to draw nb training data points at random from the training data to form a
mini-batch. One implementation of this is to first randomly shuffle the training data, before dividing it into
mini-batches in an order manner. One complete pass through the training data is called an epoch. When
we have completed one epoch, we do another random reshuffling of the training data and do another pass
though the data set. We call this procedure stochastic gradient descent or mini-batch gradient descent. A
pseudo algorithm is presented in Algorithm 8.
Since the neural network model is a composition of multiple layers, the gradient of the loss function

with respect to all the parameters ∇θ L(xi , yi , θ) can be analytically and efficiently computed by
θ=θ t
applying the chain rule of differentiation. This is called back-propagation and is not described further here.
The interested reader can find more in, for example, Goodfellow, Bengio, and Courville 2016, Section 6.5.

94
7.4 Training a neural network

Algorithm 8: Mini-batch gradient descent


1. Initialize all the parameters θ 0 in the network and set t ← 1.

2. For i = 1 to E
a) Randomly shuffle the training data {(xi , yi )}ni=1 .
b) For j = 1 to n
nb
i. Approximate the gradient of the loss function using the mini-batch
i=(j−1)nb +1 ,
{(xi , yi )}jnb

P b

ĝt = n1b jn i=(j−1)nb +1 ∇ θ L(x i , y i , θ) .
θ=θ t
ii. Do a gradient step θ t+1 = θ t − γ ĝt .
iii. Update the iteration index t ← t + 1 .

7.4.3 Learning rate

An important tuning parameter for (stochastic) gradient descent is the learning rate γ. The learning rate γ
decides the length of the gradient step that we take at each iteration. If we use a too low learning rate, the
estimate θ t from one iteration to the next will not change much and the learning will progress slower than
necessarily. This is illustrated in Figure 7.12a for a small optimization problem with only one parameter θ.
In contrast, with a too big learning rate, the estimate will pass the optimum and never converge since the
step is too long, see Figure 7.12b. For a learning rate which neither is too slow nor too fast, convergence is
achieved in a reasonable amount of iterations. A good strategy to find a good learning rate is:

• if the error keeps getting worse or oscillates widely, reduce the learning rate

• if the error is fairly consistently but slowly increasing, increase the learning rate.

Convergence with gradient descent can be achieved with a constant learning rate since the gradient
itself approaches zero when we reach the optimum, hence also the gradient step γ∇θ J(θ)|θ=θt . However,
this is not true for stochastic gradient descent since the gradient ĝt is only an approximation of the true
gradient ∇θ J(θ)|θ=θt , and ĝt will not necessarily approach 0 as J(θ) approaches its minimum. As a
consequence, we will make a too big updates as we start approaching the optimum and the stochastic
gradient algorithm will not converge. In practice, we instead adjust the learning rate. We start with a fairly
high learning rate and then decay the learning rate to a certain level. This can, for example, be achieved by
the rule
t
γt = γmin + (γmax − γmin )e− τ . (7.27)

Here the learning rate starts at γmax and goes to γmin as t → ∞. How to pick the parameters γmin , γmax
and τ is a more of an art than science. As a rule of thumb γmin can be chosen approximately as 1% of
γmax . The parameter τ depends on the size of the data set and the complexity of the problem, but should
be chosen such that multiple epochs have passed before we reach γmin . The strategy to pick γmax can be
the same as for normal gradient descent explained above.
P∞
Under
P∞ 2 certain regularity conditions and if the so called Robbins-Monro condition holds: t=1 γt = ∞
and t=1 γt < ∞, then stochastic gradient descent converges almost surely to a local minimum. However,
to be able to satisfy the Robbins-Monro condition we need γt → 0 as t → ∞. In practice this it typically
not the case and we instead let the learning rate approach a non-zero level γmin > 0 by using a scheme like
the one in (7.27). This has been found to work better in practice in many situations, despite sacrificing the
theoretical convergence of the algorithm.

95
7 Neural networks and deep learning

J(θ) 1 1 1

0.5 0.5 0.5

0 0 0

−1 0 1 −1 0 1 −1 0 1

θ θ θ
(a) Low learning rate γ = 0.05 (b) High learning rate γ = 1.2 (c) Good learning rate γ = 0.3

Figure 7.12: Optimization using gradient descent of a cost function J(θ) where θ is a scalar parameter. In the
different subfigures we use a too low learning rate (a), a too high learning rate (b), and a good learning rate (c).

7.4.4 Dropout

Like all models presented in this course, neural network models can suffer from overfitting if we have a
too flexible model in relation to the complexity of the data. Bagging (James et al. 2013, Chapter 8.2) is
one way to reduce the variance and by that also the overfitting of the model. In bagging we train an entire
ensemble of models. We train all models (ensemble members) on a different data set each, which has been
bootstrapped (sampled with replacement) from the original training data set. To make a prediction, we
first make one prediction with each model (ensemble member), and then average over all models to obtain
the final prediction.
Bagging is also applicable to neural networks. However, it comes with some practical problems; a large
neural network model usually takes quite some time to train and it also has quite some parameters to store.
To train not just one but an entire ensemble of many large neural networks would thus be very costly, both
in terms of runtime and memory. Dropout (Srivastava et al. 2014) is a bagging-like technique that allows
us to combine many neural networks without the need to train them separately. The trick is to let the
different models share parameters with each other, which reduces the computational cost and memory
requirement.

Ensemble of sub-networks

Consider a neural network like the one in Figure 7.13a. In dropout we construct the equivalent to an
ensemble member by randomly removing some of the hidden units. We say that we drop the units, hence
the name dropout. By this we achieve a sub-network of our original network. Two such sub-networks
are displayed in Figure 7.13b. We randomly sample with a pre-defined probability which units to drop,
and the collection of dropped units in one sub-network is independent from the collection of dropped
units in another sub-network. When a unit is removed, also all its incoming and outgoing connections are
removed. Not only hidden units can be dropped, but also input variables.
Since all sub-networks are of the very same original network, the different sub-networks share some
(1)
parameters with each other. For example, in Figure 7.13b the parameter β55 is present in both sub-networks.
The fact that they share parameters with each other allow us to train the ensemble of sub-networks in an
efficient manner.

Training with dropout

To train with dropout we use the mini-batch gradient descent algorithm described in Algorithm 8. In
each gradient step a mini-batch of data is used to compute an approximation of the gradient, as before.
However, instead of computing the gradient for the full network, we generate a random sub-network by

96
7.4 Training a neural network

(2)
β11
x1 σ σ

x2 σ σ

x3 σ σ

x4 σ σ (3)
β51

x5 σ σ
(1)
β55

(a) A standard neural network


(2)
β11
x1 ×
σ ×σ ×
x1 σ σ

×
x2 σ σ x2 σ ×
σ

x3 ×
σ σ ×
x 3 σ σ

×
x4 σ ×
σ (3)
β51 x4 ×
σ ×
σ

x5
(1)
σ σ x5
(1)
σ ×
σ
β55 β55

(b) Two sub-networks

Figure 7.13: A neural network with two hidden layers (a), and two sub-networks with dropped units (b). The
collection of units that have been dropped are independent between the two sub-networks.

randomly dropping units as described above. We compute the gradient for that sub-network as if the
dropped units were not present and then do a gradient step. This gradient step only updates the parameters
present in the sub-network. The parameters that are not present are left untouched. In the next gradient
step we grab another mini-batch of data, remove another randomly selected collection of units and update
the parameters present in that sub-network. We proceed in this manner until some terminal condition is
fulfilled.

Dropout vs bagging

This procedure to generate an ensemble of models differs from bagging in a few ways:

• In bagging all models are independent in the sense that they have their own parameters. In dropout
the different models (the sub-networks) share parameters.

• In bagging each model is trained until convergence. In dropout each sub-network is only trained for
a singe gradient step. However, since they share parameters all models will be updated also when
the other networks are trained.

• Similar to bagging, in dropout we train each model on a data set that has been randomly selected
from our training data. However, in bagging we usually do it on a bootstrapped version of the whole
data set whereas in dropout each model is trained on a randomly selected mini-batch of data.

Even though dropout differs from bagging in some aspects it has empirically been shown to enjoy similar
properties as bagging in terms of avoiding overfittning and reducing the variance of the model.

Prediction at test time

After we have trained the sub-networks, we want to make a prediction based on an unseen input data
point x? . In bagging we evaluate all the different models in the ensemble and combine their results. This
would be infeasible in dropout due to the very large (combinatorial) number of possible sub-networks.

97
7 Neural networks and deep learning

(2)
rβb11
x1 σ σ

x2 σ σ

x3 σ σ b
p(k | x; θ)
(3)
x4 σ σ rβb51

x5 σ σ
(1)
rβb55

Figure 7.14: The network used for prediction after being trained with dropout. All units and links are present (no
dropout) but the weights going out from a certain unit is multiplied with the probability of that unit being included
during training. This is to compensate for the fact that some of them where dropped during training. Here all units
have been kept with the probability r during training (and dropped with the probability 1 − r).

However, there is a simple trick to approximately achieve the same result. Instead of evaluating all possible
sub-networks we simply evaluate the full network containing all the parameters. To compensate for the
fact that the model was trained with dropout, we multiply each estimated parameter going out from a
unit with the probability of that unit being included during training. This ensures that the expected value
of the input to a unit is the same during training and testing, as during training only a fraction of the
incoming links were active. For instance, assume that we during training kept a unit with probability p
in all layers, then during testing we multiply all estimated parameters with p before we do a prediction
based on network. This is illustrated in Figure 7.14. This procedure of approximating the average over all
ensemble members has been shown to work surprisingly well in practice even though there is not yet any
solid theoretical argument for the accuracy of this approximation.

Dropout as a regularization method

As a way to reduce the variance and avoid overfitting, dropout can be seen as a regularization method.
There are plenty of other regularization methods for neural networks including parameter penalties (like
we did in ridge regression and LASSO in Section 2.6.1 and 2.6.2), early stopping (you stop the training
before the parameters have converged, and thereby avoid overfitting) and various sparse representations
(for example CNNs can be seen as a regularization method where most parameters are forced to be zero) to
mention a few. Since its invention, dropout has become one of the most popular regularization techniques
due to its simplicity, computationally cheap training and testing procedure and its good performance. In
fact, a good practice of designing a neural network is often to extended the network until you see that it
starts overfitting, extended it a bit more and add a regularization like dropout to avoid that overfitting.

7.5 Perspective and further reading


Although the first conceptual ideas of neural networks date back to the 1940s (McCulloch and Pitts
1943), they had their first main success stories in the late 1980s and early 1990s with the use of the
so-called back-propagation algorithm. At that stage, neural networks could, for example, be used to
classify handwritten digits from low-resolution images (LeCun, Boser, et al. 1990). However, in the late
1990s neural networks were largely forsaken because it was widely thought that they could not be used to
solve any challenging problems in computer vision and speech recognition. In these areas, neural networks
could not compete with hand-crafted solutions based on domain specific prior knowledge.
This picture has changed dramatically since the late 2000s, with multiple layers under the name deep
learning. Progress in software, hardware and algorithm parallelization made it possible to address more
complicated problems, which were unthinkable only a couple of decades ago. For example, in image

98
7.5 Perspective and further reading

recognition, these deep models are now the dominant methods of use and they reach almost human
performance on some specific tasks (LeCun, Bengio, and Hinton 2015). Recent advances based on deep
neural networks have generated algorithms that can learn how to play computer games based on pixel
information only (Mnih et al. 2015), and automatically understand the situation in images for automatic
caption generation (Xu et al. 2015).
A fairly recent and accessible introduction and overview of deep learning is provided by LeCun, Bengio,
and Hinton (2015), and a recent textbook by Goodfellow, Bengio, and Courville (2016).

99
A Probability theory

A.1 Random variables


A random variable z is a variable that can take any value on a certain set, and its value depends on the
outcome of a random event. For example, if z describes the outcome of rolling a die, the possible outcomes
are {1, 2, 3, 4, 5, 6} and the probability of each possible outcome of a die roll is typically modeled to be
1/6. To denote this, we use the probability mass function (pmf) p and write in this case p(z) = 1/6 for
z = 1, . . . , 6.
In these lecture notes we will primarily consider random variables where z is continuous, for example
taking values in R (z is a scalar) or in Rd (z is a d-dimensional vector). Since there are infinitely many
possible outcomes, we cannot speak of the probability of an outcome—it is almost always zero—but we
use the probability density function (pdf) p (as for the pmf). The probability density function p describes
the probability of z to be within a certain set C
Z
Probability of z to be in the set C = p(z)dz. (A.1)
z∈C

A random variable z with a uniform distribution on the interval [0, 3] has the pdf p(z) = 1/3 for z ∈ [0, 3],
and otherwise p(z = 0). Note that pmfs are upper bounded by 1, whereas Ra pdf can possibly take values
larger than 1. However, it holds for pdfs that they always integrates1 to 1: p(z) = 1.
A common probability distribution is the Gaussian (or Normal) distribution, whose density is defined as
 
 1 (z − µ)2
2
p(z) = N z | µ, σ = √ exp − , (A.2)
σ 2π 2σ 2

where we havemade use of exp to denote the exponential function; exp(x) = ex . We also use the notation
z ∼ N µ, σ 2 to say that z has a Gaussian distribution with parameters µ and σ 2 (i.e., its probability
density function is given by (A.2)). The symbol ∼ reads ‘distributed according to’.
The expected value or mean of the random variable z is given by
Z
E[z] = zp(z)dz. (A.3)

We can also compute the expected value of some arbitrary function g(z) applied to z as
Z
E[g(z)] = g(z)p(z)dz. (A.4)

For a scalar random variable with mean µ = E[z] the variance is defined as

Var[z] = E[(z − µ)2 ] = E[z 2 ] − µ2 . (A.5)

The variance measures the ‘spread’ of the distribution, i.e. how far a set of random number drawn from
the distribution are spread out from their mean. The variance is always non-negative. For the Gaussian
distribution (A.2) the mean and variance are given by the parameters µ and σ 2 respectively.
Now, consider two random variables z1 and z2 (both of which could be vectors). An important property
of pairs of random variables is that of independence. The variables z1 and z2 are said to be independent
1
For notational convenience, when the integration is over the whole domain of z we simply write .
R

101
A Probability theory

p(z1 |z2 =γ)

p(z2 )
p(z1 )
probability density

p(z1,z2 =γ)

0 p(z1,z2 ) 5

1 4

2 3

z1 3 2 z2
γ

4 1

Figure A.1: Illustration of a two-dimensional joint probability distribution p(z1 , z2 ) (the surface) and its two
marginal distributions p(z1 ) and p(z2 ) (the black lines). We also illustrate the conditional distribution p(z1 |z2 = γ)
(the red line), which is the distribution of the random variable z1 conditioned on the observation z2 = γ (γ = 1.5 in
the plot).

if the joint pdf factorizes according to p(z1 , z2 ) = p(z1 )p(z2 ). Furthermore, for independent random
variables the expected value of any separable function factorizes as E[g1 (z1 )g2 (z2 )] = E[g1 (z1 )]E[g2 (z2 )].
From the joint probability density function we can deduce both its two marginal densities p(z1 ) and
p(z2 ) using marginalization, as well as the so called conditional probability density function p(z2 | z1 )
using conditioning. These two concepts will be explained below.

A.1.1 Marginalization

Consider a multivariate random variable z which is composed of two components z1 and z2 , which
could be either scalars or vectors, as z = [z1T , z2T ]T . If we know the (joint) probability density function
p(z) = p(z1 , z2 ), but are interested only in the marginal distribution for z1 , we can obtain the density
p(z1 ) by marginalization

Z
p(z1 ) = p(z1 , z2 )dz2 . (A.6)

The other marginal p(z2 ) is obtained analogously by integrating over z1 instead. In Figure A.1 a joint
two-dimensional density p(z1 , z2 ) is illustrated along with their marginal densities p(z1 ) and p(z2 ).

102
A.2 Approximating an integral with a sum

A.1.2 Conditioning
Consider again the multivariate random variable z which can be partitioned in two parts z = [z1T , z2T ]T .
We can now define the conditional distribution of z1 , conditioned on having observed a value z2 = z2 , as

p(z1 , z2 )
p(z1 | z2 ) = . (A.7)
p(z2 )

If we instead have observed a value of z1 = z1 and want to use that to find the conditional distribution of
z2 given z1 = z1 , it can be done analogously. In Figure A.1 a joint two-dimensional probability density
function p(z1 , z2 ) is illustrated along with a conditional probability density function p(z1 | z2 ).
From (A.7) it follows that the joint probability density function p(z1 , z2 ) can be factorized into the
product of a marginal times a conditional,

p(z1 , z2 ) = p(z2 | z1 )p(z1 ) = p(z1 | z2 )p(z2 ). (A.8)

If we use this factorization for the denominator of the right-hand-side in (A.7) we end up with the
relationship

p(z2 | z1 )p(z1 )
p(z1 | z2 ) = . (A.9)
p(z2 )

This equation is often referred to as Bayes’ rule.

A.2 Approximating an integral with a sum


An integral over a given smooth function h(z) and a probability density p(z) can be approximated with a
sum over M samples in the following fashion
Z M
1 X
h(z)p(z)dz ≈ h(zi ) (A.10)
M
j=1

if each zi is drawn independently from zi ∼ p(z). This is called Monte Carlo integration. The approximate
equality becomes exact with probability one in the limit as the number of samples M → ∞.

103
B Unconstrained numerical optimization
Given a function L(θ), the optimization problem is about finding the value of the variable x for which the
function L(θ) is either minimized or maximized. To be precise it will here be formulated as finding the
value θb that minimizes1 the function L(θ) according to

min L(θ), (B.1)


θ

where the vector θ is allowed to be anywhere in Rn , motivating the name unconstrained optimization. The
function L(θ) will be referred to as the cost function2 , with the motivation that the minimization problem
in (B.1) is striving to minimize some cost. We will make the assumption that the cost function L(θ) is
continuously differentiable on Rn . If there are requirements on θ (e.g. that its components θ have to satisfy
a certain equation g(θ) = 0) the problem is instead referred to as a constrained optimization problem.
The unconstrained optimization problem (B.1) is ever-present across the sciences and engineering, since
it allows us to find the best—in some sense—solution to a particular problem. One example of this arises
when we are searching for the parameters in a linear regression problem by finding the parameters that
make the available measurements as likely as possible by maximizing the likelihood function. For a linear
model with Gaussian noise, this resulted in a least squares problem, for which there is an explicit expression
(the normal equations, (2.17)) describing the solution. However, for most optimization problems that
we face there are no explicit solutions available, forcing us to use approximate numerical methods in
solving these problems. We have seen several concrete examples of this kind, for example the optimization
problems arising in deep learning and logistic regression. This appendix provides a brief introduction to
the practical area of unconstrained numerical optimization.
The key in assembling a working optimization algorithm is to build a simple and useful model of the
complicated cost function L(θ) around the current value for θ. The model is often local in the sense that
it is only valid in a neighbourhood of this value. The idea is then to exploit this model to select a new
value for θ that corresponds to a smaller value for the cost function L(θ). The procedure is then repeated,
which explains why most numerical optimization algorithms are of iterative nature. There are of course
many different ways in which this can be done, but they all share a few key parts which we outline below.
Note that we only aim to provide the overall strategies underlying practical unconstrained optimization
algorithms, for precise details we refer to the many textbooks available on the subject, some of which are
referenced towards the end.

B.1 A general iterative solution


What do we mean by a solution to the unconstrained minimization problem in (B.1)? The best possible
solution is the global minimizer, which is a point θb such that L(θ)b ≤ L(θ) for all θ ∈ Rn . The global
minimizer is often hard to find and instead we typically have to settle for a local minimizer instead. A
point θb is said to be a local minimizer if there is a neighbourhood M of θb such that L(θ)
b ≤ L(θ) for all
θ ∈ M.
In our search for a local minimizer we have to start somewhere, let us denote this starting point by θ0 .
Now, if θ0 is not a local minimizer of L(θ) then there must be an increment d0 that we can add to θ0 such
that L(θ0 + d0 ) < L(θ0 ). By the same argument, if θ1 = θ0 + d0 is not a local minimizer then there must
1
Note that it is sufficient to cover minimization problem, since any maximization problem can be considered as a minimization
problem simply by changing the sign of the cost function.
2
Throughout the course we have talked quite a lot about different loss functions. These loss functions are examples of cost
functions.

105
B Unconstrained numerical optimization

be another increment d1 that we can add to θ1 such that L(θ1 + d1 ) < L(θ1 ). This procedure is repeated
until it is no longer possible to find an increment that decrease the value of the objective function. We have
then found a local minimizer. Most of the algorithms capable of solving (B.1) are iterative procedures of
this kind. Before moving on, let us mention that the increment d is often resolved into two parts according
to

d = γp. (B.2)

Here, the scalar and positive parameter γ is commonly referred to as the step length and the vector p ∈ Rn
is referred to as the search direction. The intuition is that the algorithm is searching for the solution by
moving in the search direction and how far it moves in this this direction is controlled by the step length.
The above development does of course lead to several questions, where the most pertinent are the
following:
1. How can we compute a useful search direction p?
2. How big steps should we make, i.e. what is a good value of the step length γ?
3. How do we determine when we have reached a local minimizer, and stop searching for new
directions?
Throughout the rest of this section we will briefly discuss these questions and finally we will assemble the
general form of an algorithm that is often used for unconstrained minimization.
A straightforward way of finding a general characterization of all search directions p resulting in a
decrease in the value of the cost function, i.e. directions p such that

L(θ + p) < L(θ) (B.3)

is to build a local model of the cost function around the point θ. One model of this kind is provided by
Taylor’s theorem, which builds a local polynomial approximation of a function around some point of
interest. A linear approximation of the cost function L(θ) around the point θ is given by

L(θ + p) ≈ L(θ) + pT ∇L(θ). (B.4)

By inserting the linear approximation (B.4) of the objective function into (B.3) we can provide a more
precise formulation of how to find a search direction p such that L(θ + p) < L(θ) by asking for which p it
holds that L(θ) + pT ∇L(θ) < L(θ), which can be further simplified into

pT ∇L(θ) < 0. (B.5)

Inspired by the inequality above we chose a generic description of the search direction according to

p = −V ∇L(θ), V  0, (B.6)

where we have introduced some extra flexibility via the positive definite scaling matrix V . The inspiration
came from the fact that by inserting (B.6) into (B.5) we obtain

pT L(θ) = −∇T L(θ)V T ∇L(θ) = −k∇L(θ)k2V T < 0, (B.7)

where the last inequality follows from the positivity of the squared weighted two-norm, which is defined
as kak2W = aT W a. This shows that p = −V ∇L(θ) will indeed result in a search direction that decreases
the value of the objective function. We refer to such a search direction as a descent direction.
The strategy summarized in Algorithm 9 is referred to as line search. Note that we have now introduced
subscript t to clearly show the iterative nature. The algorithm searches along the line defined by starting at
the current iterate θt and then moving along the search direction pt . The decision of how far to move
along this line is made by simply minimizing the cost function along the line

min L(θt + γpt ). (B.8)


γ

106
B.2 Commonly used search directions

Algorithm 9: General form of unconstrained minimization


1. Set t = 0.

2. while stopping criteria is not satisfied do


a) Compute a search direction pt = −Vt ∇L(θt ) for some Vt  0.
b) Find a step length γt > 0 such that L(θt + γt pt ) < L(θt ).
c) Set θt+1 = θt + γt pt .
d) Set t ← t + 1.

3. end while

Note that this is a one-dimensional optimization problem, and hence simpler to deal with compared to the
original problem. The step length γt that is selected in (B.8) controls how far to move along the current
search direction pt . It is sufficient to solve this problem approximately in order to find an acceptable step
length, since as long as L(θt + γt pt ) < L(θt ) it is not crucial to find the global minimizer for (B.8).
There are several different indicators that can be used in designing a suitable stopping criteria for row 2
in Algorithm 9. The task of the stopping criteria is to control when to stop the iterations. Since we know
that the gradient is zero at a stationary point it is useful to investigate when the gradient is close to zero.
Another indicator is to keep an eye on the size of the increments between adjacent iterates, i.e. when θt+1
is close to θt .
In the so-called trust region strategy the order of step 2a and step 2b in Algorithm 9 is simply reversed,
i.e. we first decide how far to step and then we chose in which direction to move.

B.2 Commonly used search directions


Three of the most popular search directions corresponds to specific choices when it comes to the positive
definite matrix Vt in step 2a of Algorithm 9. The simplest choice is to make use of the identity matrix,
resulting in the so-called steepest descent direction described in Section B.2.1. The Newton direction
(Section B.2.2) is obtained by using the inverse of the Hessian matrix and finally we have the quasi-Newton
direction (Section B.2.3) employing an approximation of the inverse Hessian.

B.2.1 Steepest descent direction


Let us start by noting that according to the definition of the scalar product3 , the descent condition (B.5)
imposes the following requirement of the search direction

pT ∇L(θt ) = kpk2 k∇L(θt )k2 cos(ϕ) < 0, (B.9)

where ϕ denotes the angle between the two vectors p and ∇L(θt ). Since we are only interested in finding
the direction we can without loss of generality fix the length of p, implying the scalar product pT ∇L(θt )
is made as small as possible by selecting ϕ = π, corresponding to

p = −∇L(θt ). (B.10)

Recall that the gradient vector at a point is the direction of maximum rate of change of the function at that
point. This explains why the search direction suggested in (B.10) is referred to as the steepest descent
direction.
3
The scalar (or dot) product of two vectors a and b is defined as aT b = kakkbk cos(ϕ), where kak denotes the length
(magnitude) of the vector a and ϕ denotes the angle between a and b.

107
B Unconstrained numerical optimization

Sometimes, the use of the steepest descent direction can be very slow. The reason for this is that there is
more information available about the cost function that the algorithm can make use of, which brings us to
the Newton and the quasi-Newton directions described below. They make use of additional information
about the local geometry of the cost function by employing a more descriptive local model.

B.2.2 Newton direction


Let us now instead make use of a better model of the objective function, by also keeping the quadratic
term of the Taylor expansion. The result is the following quadratic approximation m(θt , pt ) of the cost
function around the current iterate θt
1 T
L(θt + pt ) ≈ L(θt ) + pT
t gt + pt Ht pt (B.11)
| {z 2 }
=m(θt ,pt )

where gt = ∇L(θ)|θ=θt denotes the cost function gradient and Ht = ∇2 L(θ)|θ=θt denotes the Hessian,
both evaluated at the current iterate θt . The idea behind the Newton direction is to select the search
direction that minimizes the quadratic model in (B.11), which is obtained by setting its derivative

∂m(θt , pt )
= gt + Ht pt (B.12)
∂pt
to zero, resulting in

pt = −Ht−1 gt . (B.13)

It is often too difficult or too expensive to compute the Hessian, which has motivated the development of
search directions employing an approximation of the Hessian. The generic name for these are quasi-Newton
directions.

B.2.3 Quasi-Newton
The quasi-Newton direction makes use of a local quadratic model m(θt , pt ) of the cost function according
to (B.11), similarly to what was done in finding the Newton direction. However, rather than assuming that
the Hessian is available, the Hessian will now instead be learned from the information that is available in
the cost function values and its gradients.
Let us first denote the line segment connecting two adjacent iterates θt and θt+1 by

rt (τ ) = θt + τ (θt+1 − θt ), τ ∈ [0, 1]. (B.14)

From the fundamental theorem of calculus we know that


Z 1

∇L(rt (τ ))dτ = ∇L(rt (1)) − ∇L(rt (0)) = ∇L(θt+1 ) − ∇L(θt ) = gt+1 − gt , (B.15)
0 ∂τ
and from the chain rule we have that
∂ ∂rt (τ )
∇L(rt (τ )) = ∇2 L(rt (τ )) = ∇2 L(rt (τ ))(θt+1 − θt ). (B.16)
∂τ ∂τ
Hence, in combining (B.15) and (B.16) we obtain
Z 1 Z 1

yt = ∇L(rt (τ ))dτ = ∇2 L(rt (τ ))st dτ. (B.17)
0 ∂τ 0

where we have defined yt = gt+1 − gt and st = θt+1 − θt . An interpretation of the above equation is
that the difference between two consecutive gradients yt is given by integrating the Hessian times st for

108
B.3 Further reading

points θ along the line segment rt (τ ) defined in (B.14). The approximation underlying quasi-Newton
methods is now to assume that this integral can be described by a constant matrix Bt+1 , resulting in the
following approximation

yt = Bt+1 st (B.18)

of the integral (B.17), which is sometimes referred to as the secant condition or the quasi-Newton equation.
The secant condition above is still not enough to determine the matrix Bt+1 , since even though we
know that Bt+1 is symmetric there are still too many degrees of freedom available. This is solved using
regularization and Bt+1 is selected as the solution to

Bt+1 = min kB − Bt k2W ,


B (B.19)
s.t. B = B T , Bst = yt ,

for some weighting matrix W . Depending on which weighting matrix that is used we obtain different
algorithms. The most common quasi-Newton algorithms are referred to as BFGS (named after Broyden,
Fletcher, Goldfarb and Shanno), DFP (named after Davidon, Fletcher and Powell) and Broyden’s method.
The resulting Hessian approximation Bt+1 is then used in place of the true Hessian.

B.3 Further reading


This appendix is heavily inspired by the solid general introduction to the topic of numerical solutions to
optimization problems given by Nocedal and Wright (2006) and by Wills (2017). In solving optimization
problems the initial important classification of the problem is whether it is convex or non-convex. Here
we have mainly been concerned with the numerical solution of non-convex problems. When it comes to
convex problems Boyd and Vandenberghe (2004) provide a good engineering introduction. A thorough
and timely introduction to the use of numerical optimization in the machine learning context is provided
by Bottou, Curtis, and Nocedal (2017). The focus is naturally on large scale problems and as we have
explained in the deep learning chapter this naturally leads to stochastic optimization problems.

109
Bibliography
Abu-Mostafa, Yaser S., Malik Magdon-Ismail, and Hsuan-Tien Lin (2012). Learning From Data. A short
course. AMLbook.com.
Barber, David (2012). Bayesian reasoning and machine learning. Cambridge University Press.
Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning. Springer.
Bottou, L., F. E. Curtis, and J. Nocedal (2017). Optimization methods for large-scale machine learning.
Tech. rep. arXiv:1606.04838v2.
Boyd, S. and L. Vandenberghe (2004). Convex Optimization. Cambridge, UK: Cambridge University
Press.
Breiman, Leo (Oct. 2001). “Random Forests”. In: Machine Learning 45.1, pp. 5–32. issn: 1573-0565. doi:
10.1023/A:1010933404324. url: https://doi.org/10.1023/A:1010933404324.
Deisenroth, M. P., A. Faisal, and C. O. Ong (2019). Mathematics for machine learning. Cambridge
University Press.
Dheeru, Dua and Efi Karra Taniskidou (2017). UCI Machine Learning Repository. url: http://archive.
ics.uci.edu/ml.
Efron, Bradley and Trevor Hastie (2016). Computer age statistical inference. Cambridge University Press.
Ezekiel, Mordecai and Karl A. Fox (1959). Methods of Correlation and Regression Analysis. John Wiley
& Sons, Inc.
Freund, Yoav and Robert E. Schapire (1996). “Experiments with a new boosting algorithm”. In: Proceedings
of the 13th International Conference on Machine Learning (ICML).
Friedman, Jerome (2001). “Greedy function approximation: A gradient boosting machine”. In: Annals of
Statistics 29.5, pp. 1189–1232.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani (2000). “Additive logistic regression: a statistical
view of boosting (with discussion)”. In: The Annals of Statistics 28.2, pp. 337–407.
Gelman, Andrew et al. (2013). Bayesian data analysis. 3rd ed. CRC Press.
Ghahramani, Zoubin (May 2015). “Probabilistic machine learning and artificial intelligence”. In: Nature
521.7553, pp. 452–459.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville (2016). Deep Learning. http://www.deeplearningbook.
org. MIT Press.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman (2009). The elements of statistical learning. Data
mining, inference, and prediction. 2nd ed. Springer.
Hastie, Trevor, Robert Tibshirani, and Martin J. Wainwright (2015). Statistical learning with sparsity: the
Lasso and generalizations. CRC Press.
Hoerl, Arthur E. and Robert W. Kennard (1970). “Ridge regression: biased estimation for nonorthogonal
problems”. In: Technometrics 12.1, pp. 55–67.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani (2013). An introduction to statistical
learning. With applications in R. Springer.
Jordan, M. I. and T. M. Mitchell (2015). “Machine learning: trends, perspectives, and prospects”. In:
Science 349.6245, pp. 255–260.
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton (2015). “Deep learning”. In: Nature 521, pp. 436–444.
LeCun, Yann, Bernhard Boser, et al. (1990). “Handwritten Digit Recognition with a Back-Propagation
Network”. In: Advances in Neural Information Processing Systems (NIPS), pp. 396–404.
MacKay, D. J. C. (2003). Information theory, inference and learning algorithms. Cambridge University
Press.

111
Bibliography

Mason, Llew, Jonathan Baxter, Peter Bartlett, and Marcus Frean (1999). “Boosting Algorithms as Gradient
Descent”. In: Proceedings of the 12th International Conference on Neural Information Processing
Systems (NIPS).
McCulloch, Warren S and Walter Pitts (1943). “A logical calculus of the ideas immanent in nervous
activity”. In: The bulletin of mathematical biophysics 5.4, pp. 115–133.
Mnih, Volodymyr et al. (2015). “Human-level control through deep reinforcement learning”. In: Nature
518.7540, pp. 529–533.
Murphy, Kevin P. (2012). Machine learning – a probabilistic perspective. MIT Press.
Nocedal, J. and S. J. Wright (2006). Numerical Optimization. 2nd ed. Springer Series in Operations
Research. New York, USA: Springer.
Srivastava, Nitish et al. (2014). “Dropout: A simple way to prevent neural networks from overfitting”. In:
The Journal of Machine Learning Research 15.1, pp. 1929–1958.
Tibshirani, Robert (1996). “Regression Shrinkage and Selection via the LASSO”. In: Journal of the Royal
Statistical Society (Series B) 58.1, pp. 267–288.
Wills, A. G. (2017). “Real-time optimisation for embedded systems”. Lecture notes.
Xu, Kelvin et al. (2015). “Show, attend and tell: Neural image caption generation with visual attention”.
In: Proceedings of the International Conference on Learning representations (ICML).

112

You might also like