Lecture 1, Part 1: Linear Regression: Roger Grosse
Lecture 1, Part 1: Linear Regression: Roger Grosse
Lecture 1, Part 1: Linear Regression: Roger Grosse
Roger Grosse
1 Introduction
Let’s jump right in and look at our first machine learning algorithm, linear
regression. In regression, we are interested in predicting a scalar-valued
target, such as the price of a stock. By linear, we mean that the target must
be predicted as a linear function of the inputs. This is a kind of supervised
learning algorithm; recall that, in supervised learning, we have a collection
of training examples labeled with the correct outputs.
Regression is an important problem in its own right. But today’s dis-
cussion will also highlight a number of themes which will recur throughout
the course:
• Thinking about the data points and the model parameters as vectors.
• Derive both the closed-form solution and the gradient descent updates
for linear regression.
1
Figure 1: Three possible hypotheses for a linear regression model, shown in
data space and weight space.
• Know how linear regression can learn nonlinear functions using feature
maps.
2 Problem setup
In order to formulate a learning problem mathematically, we need to define
two things: a model and a loss function. The model, or architecture
defines the set of allowable hypotheses, or functions that compute predic-
tions from the inputs. In the case of linear regression, the model simply
consists of linear functions. Recall that a linear function of D inputs is
parameterized in terms of D coefficients, which we’ll call the weights, and
an intercept term, which we’ll call the bias. Mathematically, this is written
as: X
y= wj xj + b. (1)
j
Figure 1 shows two ways to visualize linear models. In this case, the data are
one-dimensional, so the model reduces to simply y = wx + b. On one side,
we have the data space, or input space, where t is plotted as a function
of x. Three different possible linear fits are shown. On the other side, we
have weight space, where the corresponding pairs (w, b) are plotted. You should study these figures and
Clearly, some of these linear fits are better than others. In order to try to understand how the lines in
the left figure map onto the X’s on
quantify how good the fit is, we define a loss function. This is a function the right figure. Think back to
L(y, t) which says how far off the prediction y is from the target t. In linear middle school. Hint: w is the slope
regression, we use squared error, defined as of the line, and b is the y-intercept.
1
L(y, t) = (y − t)2 . (2)
2
This is small when y and t are close together, and large when they are far
apart. In general, the value y − t is known as the residual, and we’d like Why is there the factor of 1/2 in
the residuals to be close to zero. front? It just makes the
calculations convenient.
When we combine our model and loss function, we get an optimization
problem, where we are trying to minimize a cost function with respect
to the model parameters (i.e. the weights and bias). The cost function is
simply the loss, averaged over all the training examples. When we plug in
2
Figure 2: Left: three hypotheses for a regression dataset. Middle: Contour
plot of least-squares cost function for the regression problem. Colors of the
points match the hypotheses. Right: Surface plot matching the contour
plot. Surface plots are usually hard to interpret, so we won’t look at them
very often.
the model definition (Eqn. 1), we get the following cost function:
N
1 X
J (w1 , . . . , wD , b) = L(y (i) , t(i) ) (3)
N
i=1
N
1 X 2
= y (i) − t(i) (4)
2N
i=1
2
N
1 X X (i)
= wj xj + b − t(i) (5)
2N
i=1 j
3
to Eqn. 5, we get
N
∂J 1 X (i)
X (i)
= xj wj 0 xj 0 + b − t(i) (6)
∂wj N
i=1 j0
N
∂J 1 X X (i)
= wj 0 xj 0 + b − t(i) . (7)
∂b N 0
i=1 j
It’s possible to simplify this a bit — notice that part of the term in paren-
theses is simply the prediction. The partial derivatives can be rewritten It’s always a good idea to try to
as: simplify equations by finding
familiar terms.
N
∂J 1 X (i) (i)
= xj (y − t(i) ) (8)
∂wj N
i=1
N
∂J 1 X (i)
= y − t(i) . (9)
∂b N
i=1
Now, it’s good practice to do a sanity check of the derivatives. For instance,
suppose we overestimated all of the targets. Then we should be able to
improve the predictions by decreasing the bias, while holding all of the
weights fixed. Does this work out mathematically? Well, the residuals y (i) −
t(i) will be positive, so based on Eqn. 9, ∂J /∂b will be positive. This means
increasing the bias will increase J , and deceasing the bias will decrease J
— which matches up with our expectation. So Eqn. 9 is plausible. Try to
come up with a similar sanity check for ∂J /∂wj . Later in this course, we’ll
Now how do we use these partial derivatives? Let’s discuss the two introduce a more powerful way to
test partial derivative
methods which we will use throughout the course. computations, but you should still
get used to doing sanity checks on
3.1 Direct solution all your computations!
One way to compute the minimum of a function is to set the partial deriva-
tives to zero. Recall from single variable calculus that (assuming a function
is differentiable) the minimum x? of a function f has the property that the
derivative df /dx is zero at x = x? . Note that the converse is not true: if
df /dx = 0, then x? might be a maximum or an inflection point, rather than
a minimum. But the minimum can only occur at points that have derivative
zero.
An analogous result holds in the multivariate case: if f is differentiable,
then all of the partial derivatives ∂f /∂xi are zero at the minimum. The
intuition is simple: if ∂f /∂xi is positive, then one can decrease f slightly
by decreasing xi slightly. Conversely, if ∂f /∂xi is negative, then one can
decrease f slightly by increasing xi slightly. In either case, this implies we’re
not at the minimum. Therefore, if the minimum exists (i.e. f doesn’t keep
growing as x goes to infinity), it occurs at a critical point, i.e. a point
where the partial derivatives are zero. This gives us a strategy for finding
minima: set the partial derivatives to zero, and solve for the parameters.
This method is known as direct solution.
Let’s apply this to linear regression. For simplicity, let’s assume the
model doesn’t have a bias term. (We actually don’t lose anything by getting
4
rid of the bias. Just add a “dummy” input x0 which always takes the value
1; then the weight w0 acts as a bias.) We simplify Eqn. 6 to remove the
bias, and set the partial derivatives to zero:
N D
∂J 1 X (i)
X (i)
= xj wj 0 xj 0 − t(i) = 0 (10)
∂wj N 0
i=1 j =1
Since we’re trying to solve for the weights, let’s pull these out:
D N N
!
∂J 1 X X (i) (i) 1 X (i) (i)
= x j x j 0 wj 0 − xj t = 0 (11)
∂wj N 0 N
j =1 i=1 i=1
5
The reason that this formula gives the direction of steepest ascent is beyond
the scope of this course. (You would learn about it in a multivariable
calculus class.) But this suggests that to decrease a function as quickly
as possible, we should update the parameters in the direction opposite the
gradient.
We can formalize this using the following update rule, which is known
as gradient descent:
∂J
w ←w−α , (14)
∂w
or in terms of coordinates,
∂J
wj ← wj − α . (15)
∂wj
The symbol ← means that the left-hand side is updated to take the value
on the right-hand side; the constant α is known as a learning rate. The
larger it is, the larger a step we take. We’ll talk in much more detail later
about how to choose a learning rate, but in general it’s good to choose a
small value such as 0.01 or 0.001. If we plug in the formula for the partial
derivatives of the regression model (Eqn. 8), we get the update rule: In practice, we rarely if ever go
through this last step. From a
N software engineering perspective,
1 X it’s better to write our code in a
wj ← wj − α xj (y (i) − t(i) ) (16)
N modular way, where one function
i=1 computes the gradient, and
another function implements
So we just repeat this update lots of times. What does gradient descent gradient descent, taking the
give us in the end? For analyzing iterative algorithms, it’s useful to look for gradient as given.
fixed points, i.e. points where the iterate doesn’t change. By inspecting
Eqn. 14, setting the left-hand side equal to the right-hand side, we see that
the fixed points occur where ∂J /∂w = 0. Since we know the gradient
must be zero at the optimum, this is an encouraging sign that maybe it will
converge to the optimum. But there are lots of things that could go wrong,
such as divergence or local optima; we’ll look at these in more detail in a
later lecture. Lecture 9 discusses optimization
You might ask: by setting the partial derivatives to zero, we compute the issues.
exact solution. With gradient descent, we never actually reach the optimum,
but merely approach it gradually. Why, then, would we ever prefer gradient
descent? Two reasons:
6
For these reasons, gradient descent will be our workhorse throughout the
course. We will use it to train almost all of our models, with the exception
of a handful for which we can derive exact solutions.
4 Vectorization
Now it’s time to bring in linear algebra. We’re going to rewrite the linear
regression model, as well as both solution methods, in terms of operations
on matrices and vectors. This process is known as vectorization. There
are two reasons for doing this: Vectorization takes a lot of
practice to get used to. We’ll cover
1. The formulas can be much simpler, more compact, and more readable a lot of examples in the first few
weeks of the course. I’d
in this form.
recommend practicing these until
they start to feel natural.
2. Vectorized code can be much faster than explicit for-loops, for several
reasons.
First, we need to represent the data and model parameters in the form of
matrices and vectors. If we have N training examples, each D-dimensional,
we will represent the inputs as an N × D matrix X. Each row of X cor-
responds to a training example, and each column corresponds to a single
input dimension. The weights are represented as a D-dimensional vector
w, and the targets are represented as a N -dimensional vector t. In general, matrices will be
The predictions are computed using a matrix-vector product denoted with capital boldface,
vectors with lowercase boldface,
and scalars with plain type.
y = Xw + b1, (17)
where 1 denotes a vector of all ones. We can express the cost function in
7
vectorized form: You should stop now and try to
show that these equations are
1 equivalent to Eqns. 3–5. The only
J = ky − tk2 (18) way you get comfortable with this
2N
is by practicing.
1
= kXw + b1 − tk2 . (19)
2N
Note that this is considerably simpler than Eqn. 5. Even more importantly,
it saves us from having to explicitly sum over the indices i and j. As our
models get more complicated, we would run out of convenient letters to use
as indices if we didn’t vectorize.
Now let’s revisit the exact solution for linear regression. We derived
(i) (i)
a system of linear equations, with coefficients Ajj 0 = N1 N
P
i=1 xj xj 0 and
(i) (i)
cj = N1 N
P
i=1 xj t . In terms of linear algebra, we can write these as the
matrix A = N1 X> X and c = N1 X> t. The solution to the linear system
Aw = c is given by w = A−1 c (assuming A is invertible), so this gives us
a formula for the optimal weights:
−1
w = X> X X> t. (20)
5 Feature mappings
Linear regression might sound pretty limited. What if the true relationship
between inputs and targets is nonlinear? Fortunately, there’s an easy way to
use linear regression to learn nonlinear dependencies: use a feature mapping.
I’ll introduce this by way of an example. Suppose we want to approximate it
with a cubic polynomial. In other words, we would compute the predictions
as:
y = w3 x3 + w2 x2 + w1 x + w0 . (22)
This setting is known as polynomial regression.
Let’s use the squared error loss function, just as with ordinary linear re-
gression. The important thing to notice is that algorithmically, polynomial
regression is no different from linear regression. We can apply any of the
linear regression algorithms described above, using (x, x2 , x3 ) as the inputs.
Mathematically, we define a feature mapping ψ, in this case Just as in Section 3.1, we’re
including a constant feature to
account for the bias term, since
1
x this simplifies the notation.
ψ(x) = x2 ,
(23)
x3
and compute the predictions as y = w> ψ(x) instead of w> x. The rest of
the algorithm is completely unchanged.
8
Feature maps are a useful tool, but they’re not a silver bullet, for several
reasons:
• The features must be known in advance. It’s not always easy to pick
good features, and up until very recently, feature engineering would
take up most of the time and ingenuity in building a practical machine
learning system.
• In high dimensions, the feature representations can get very large. For
instance, the number of terms in a cubic polynomial is cubic in the
dimension! It’s possible to work with
polynomial feature maps efficiently
In this course, rather than construct feature maps, we will use neural net- using something called the “kernel
trick,” but that’s beyond the scope
works to learn nonlinear predictors directly from the raw inputs. In most
of this course.
cases, this eliminates the need for hand-engineering of features.
6 Generalization
We don’t just want a learning algorithm to make correct predictions on
the training examples; we’d like it to generalize to examples it hasn’t
seen before. The average squared error on novel examples is known as the
generalization error, and we’d like this to be as small as possible.
Returning to the previous example, let’s consider three different polyno-
mial models: (a) a linear function, or equivalently, a degree 1 polynomial;
(b) a cubic polynomial; (c) a degree-10 polynomial. The linear function may
be too simplistic to describe the data; this is known as underfitting. The The terms underfitting and
degree-10 polynomial may be able to fit every training example exactly, but overfitting are a bit misleading,
since they suggest the two
only by learning a crazy function. It would make silly predictions every- phenomena are mutually exclusive.
where except the observed data. This is known as overfitting. The cubic In fact, most machine learning
polynomial is a reasonable compromise. We need to worry about both models suffer from both problems
underfitting and overfitting in pretty much every application of machine simultaneously.
learning.
The degree of the polynomial is an example of a hyperparameter.
Hyperparameters are values that we can’t include in the training procedure
itself, but which we need to set using some other means. In practice, we nor- Statisticians prefer the term
mally tune hyperparameters by partitioning the dataset into three different metaparameter since
hyperparameter has a different
subsets: meaning in statistics.
3. The test set is used at the very end, to estimate the generalization
error of the final model, once all hyperparameters have been chosen.
We will talk about validation and generalization in a lot more detail later
on in this course.