斯坦福大学机器学习数学基础 57-64
斯坦福大学机器学习数学基础 57-64
斯坦福大学机器学习数学基础 57-64
constraints with strict inequality (as needed for Slater’s condition to hold).8
Recall that in our discussion of the Lagrangian-based formulation of the primal problem,
we stated that the inner maximization, maxλ:λi ≥0 L(x, λ), was constructed in such a way
that the infeasible region of f was “carved away”, leaving only points in the feasible region
as candidate minima. The same idea of using penalties to ensure that minimizers stay in the
feasible region is the basis of barrier -based optimization. Specifically, if B(z) is the barrier
function
8
<0 z < 0
B(z) = :
∞ z ≥ 0,
When gi (x) < 0, the objective of the problem is simply f (x); infeasible points are “carved
away” using the barrier function B(z).
While conceptually correct, optimization using the straight barrier function B(x) is nu-
merically difficult. To ameliorate this, the log-barrier optimization algorithm approximates
the solution to (4) by solving the unconstrained problem,
1X m
minimize f (x) − log(−gi (x)).
x t i=1
for some fixed t > 0. Here, the function −(1/t) log(−z) ≈ B(z), and the accuracy of the
approximation increases as t → ∞. Rather than using a large value of t in order to obtain
a good approximation, however, the log-barrier algorithm works by solving a sequence of
unconstrained optimization problems, increasing t each time, and using the solution of the
previous unconstrained optimization problem as the initial point for the next unconstrained
optimization. Furthermore, at each point in the algorithm, the primal solution points stay
strictly in the interior of the feasible region:
8
For more information on finding feasible starting points for barrier algorithms, see [1], pages 579-585.
For inequality-problems where the primal problem is feasible but not strictly feasible, primal-dual interior
point methods are applicable, also described in [1], pages 609-615.
10
55
Log-barrier optimization
One might expect that as t increases, the difficulty of solving each unconstrained minimiza-
tion problem also increases due to numerical issues or ill-conditioning of the optimization
problem. Surprisingly, Nesterov and Nemirovski showed in 1994 that this is not the case
for certain types of barrier functions, including the log-barrier; in particular, by using an
appropriate barrier function, one obtains a general convex optimization algorithm which
takes time polynomial in the dimensionality of the optimization variables and the desired
accuracy!
11
56
Also, if you find this material fascinating, make sure to check out Stephen Boyd’s class,
EE364: Convex Optimization I, which will be offered during the Winter Quarter. The
textbook for the class (listed as [1] in the References) has a wealth of information about
convex optimization and is available for browsing online.
References
[1] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge UP, 2004.
Online: http://www.stanford.edu/∼boyd/cvxbook/
1 Xm
minimize kwk2 + C ξi
w,b,ξ 2 i=1
subject to y (i) (wT x(i) + b) ≥ 1 − ξi , i = 1, . . . , m,
ξi ≥ 0, i = 1, . . . , m.
First, we put this into our standard form, with “≤ 0” inequality constraints and no equality
constraints. That is,
1 2
X m
minimize kwk + C ξi
w,b,ξ 2 i=1
subject to 1 − ξi − y (i) (wT x(i) + b) ≤ 0, i = 1, . . . , m,
−ξi ≤ 0, i = 1, . . . , m.
1 2
Xm Xm
(i) T (i)
Xm
L(w, b, ξ, α, β) = kwk + C ξi + αi (1 − ξi − y (w x + b)) − βi ξi ,
2 i=1 i=1 i=1
To get the dual problem in the form shown in the lecture notes, however, we still have a
little more work to do. In particular,
10
Here, it is important to note that (w, b, ξ) collectively play the role of the x primal variables. Similarly,
(α, β) collectively play the role of the λ dual variables used for inequality constraints. There are no “ν” dual
variables here since there are no affine constraints in this problem.
12
57
1. Eliminating the primal variables. To eliminate the primal variables from the dual
problem, we compute θD (α, β) by noticing that
Adding (6) and (7) to the constraints of our dual optimizaton problem, we obtain,
ˆ
θD (α, β) = L(ŵ, b̂, ξ)
1 Xm Xm X m
= kŵk2 + C ξˆi + αi (1 − ξˆi − y (i) (ŵT x(i) + b̂)) − βi ξˆi
2 i=1 i=1 i=1
1 Xm Xm Xm
= kŵk2 + C ξˆi + αi (1 − ξˆi − y (i) (ŵT x(i) )) − βi ξˆi
2 i=1 i=1 i=1
1 Xm
= kŵk2 + αi (1 − y (i) (ŵT x(i) )).
2 i=1
1 Xm Xm
1 Xm
kŵk2 + αi (1 − y (i) (ŵT x(i) )) = αi + kŵk2 − ŵT αi y (i) x(i)
2 i=1 i=1 2 i=1
Xm
1
= αi + kŵk2 − kŵk2
i=1 2
Xm
1
= αi − kŵk2
i=1 2
Xm
1X m X m
= αi − αi αi y (i) y (j) hx(i) , x(j) i.
i=1 2 i=1 j=1
13
58
Therefore, our dual problem (with no more primal variables) is simply
X
m
1X m X m
maximize αi − αi αi y (i) y (j) hx(i) , x(j) i
α,β
i=1 2 i=1 j=1
subject to αi ≥ 0, i = 1, . . . , m,
βi ≥ 0, i = 1, . . . , m,
αi + βi = C, i = 1, . . . , m,
X
m
αi y (i) = 0.
i=1
2. KKT complementary. KKT complementarity requires that for any primal optimal
(w∗ , b∗ , ξ ∗ ) and dual optimal (α∗ , β ∗ ),
αi∗ (1 − ξi∗ − y (i) (w∗ T x(i) + b∗ )) = 0
βi∗ ξi∗ = 0
for i = 1, . . . , m. From the first condition, we see that if αi > 0, then in order for the
product to be zero, then 1 − ξi∗ − y (i) (w∗ T x(i) + b∗ ) = 0. It follows that
y (i) (w∗ T x(i) + b∗ ) ≤ 1
since ξ ∗ ≥ 0 by primal feasibility. Similarly, if βi∗ > 0, then ξi∗ = 0 to ensure comple-
mentarity. From the primal constraint, y (i) (wT x(i) + b) ≥ 1 − ξi , it follows that
y (i) (w∗ T x(i) + b∗ ) ≥ 1.
Finally, since βi∗ > 0 is equivalent to αi∗ < C (since α∗ + βi∗ = C), we can summarize
the KKT conditions as follows:
αi∗ = 0 ⇒ y (i) (w∗ T x(i) + b∗ ) ≥ 1,
0 < αi∗ < C ⇒ y (i) (w∗ T x(i) + b∗ ) = 1,
αi∗ = C ⇒ y (i) (w∗ T x(i) + b∗ ) ≤ 1.
3. Simplification. We can tidy up our dual problem slightly by observing that each pair
of constraints of the form
βi ≥ 0 αi + βi = C
is equivalent to the single constraint, αi ≤ C; that is, if we solve the optimization
problem
X
m
1X m X m
maximize αi − αi αi y (i) y (j) hx(i) , x(j) i
α,β
i=1 2 i=1 j=1
subject to 0 ≤ αi ≤ C, i = 1, . . . , m, (8)
X
m
αi y (i) = 0.
i=1
14
59
and subsequently set βi = C − αi , then it follows that (α, β) will be optimal for the
previous dual problem above. This last form, indeed, is the form of the soft-margin
SVM dual given in the lecture notes.
15
60
Hidden Markov Models Fundamentals
Daniel Ramage
December 1, 2007
Abstract
How can we apply machine learning to data that is represented as a
sequence of observations over time? For instance, we might be interested
in discovering the sequence of words that someone spoke based on an
audio recording of their speech. Or we might be interested in annotating
a sequence of words with their part-of-speech tags. These notes provides a
thorough mathematical introduction to the concept of Markov Models
a formalism for reasoning about states over time and Hidden Markov
Models where we wish to recover a series of states from a series of
observations. The nal section includes some pointers to resources that
present this material from other perspectives.
1 Markov Models
Given a set of states S = {s1 , s2 , ...s|S| } we can observe a series over time
~z ∈ S T . For example, we might have the states from a weather system S =
{sun, cloud, rain} with |S| = 3 and observe the weather over a few days {z1 =
ssun , z2 = scloud , z3 = scloud , z4 = srain , z5 = scloud } with T = 5.
The observed states of our weather example represent the output of a random
process over time. Without some further assumptions, state sj at time t could
be a function of any number of variables, including all the states from times 1
to t−1 and possibly many others that we don't even model. However, we will
make two Markov assumptions that will allow us to tractably reason about
time series.
The limited horizon assumption is that the probability of being in a
state at time t depends only on the state at time t − 1. The intuition underlying
this assumption is that the state at time t represents enough summary of the
past to reasonably predict the future. Formally:
61
P (zt |zt−1 ) = P (z2 |z1 ); t ∈ 2...T
As a convention, we will also assume that there is an initial state and initial
observation z0 ≡ s0 , where s0 represents the initial probability distribution over
states at time 0. This notational convenience allows us to encode our belief
about the prior probability of seeing the rst real state z1 as P (z1 |z0 ). Note
that P (zt |zt−1 , ..., z1 ) = P (zt |zt−1 , ..., z1 , z0 ) because we've dened z0 = s0 for
any state sequence. (Other presentations of HMMs sometimes represent these
prior believes with a vector π ∈ R|S| .)
We parametrize these transitions by dening a state transition matrix A ∈
R(|S|+1)×(|S|+1) . The value Aij is the probability of transitioning from state i
to state j at any time t. For our sun and rain example, we might have following
transition matrix:
Note that these numbers (which I made up) represent the intuition that the
weather is self-correlated: if it's sunny it will tend to stay sunny, cloudy will
stay cloudy, etc. This pattern is common in many Markov models and can
be observed as a strong diagonal in the transition matrix. Note that in this
example, our initial state s0 shows uniform probability of transitioning to each
of the three states in our weather system.
62