Murphy Book Solution
Murphy Book Solution
Murphy Book Solution
Contents
1 Introduction 2
1.1 Constitution of this document . . . . . . . . . . . . . . . . . . 2
1.2 On Machine Learning: A Probabilistic Perspective . . . . . . 2
1.3 What is this document? . . . . . . . . . . . . . . . . . . . . . 3
1.4 Updating log . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Probability 6
2.1 Probability are sensitive to the form of the question that was
used to generate the answer . . . . . . . . . . . . . . . . . . . 6
2.2 Legal reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Vriance of a sum . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Bayes rule for medical diagnosis . . . . . . . . . . . . . . . . . 7
2.5 The Monty Hall problem(The dilemma of three doors) . . . . 7
2.6 Conditional Independence . . . . . . . . . . . . . . . . . . . . 7
2.7 Pairwise independence does not imply mutual independence . 8
2.8 Conditional independence iff joint factorizes . . . . . . . . . . 8
2.9 Conditional independence* . . . . . . . . . . . . . . . . . . . 9
2.10 Deriving the inverse gamma density . . . . . . . . . . . . . . 9
2.11 Normalization constant for a 1D Gaussian . . . . . . . . . . . 9
2.12 Expressing mutual information in terms of entropies . . . . . 10
2.13 Mutual information for correlated normals . . . . . . . . . . . 10
2.14 A measure of correlation . . . . . . . . . . . . . . . . . . . . . 10
1
CONTENTS 2
4 Gaussian models 22
4.1 Uncorrelated does not imply independent . . . . . . . . . . . 22
4.2 Uncorrelated and Gaussian does not imply independent un-
less jointly Gaussian . . . . . . . . . . . . . . . . . . . . . . . 22
CONTENTS 3
5 Bayesian statistics 31
5.1 Proof that a mixture of conjugate priors is indeed conjugate . 31
5.2 Optimal threshold on classification probability . . . . . . . . 31
5.3 Reject option in classifiers . . . . . . . . . . . . . . . . . . . . 31
5.4 More reject options . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5 Newsvendor problem . . . . . . . . . . . . . . . . . . . . . . . 32
5.6 Bayes factors and ROC curves . . . . . . . . . . . . . . . . . 32
5.7 Bayes model averaging helps predictive accuracy . . . . . . . 32
5.8 MLE and model selection for a 2d discrete distribution . . . . 33
5.9 Posterior median is optimal estimate under L1 loss . . . . . . 34
5.10 Decision rule for trading off FPs and FNs . . . . . . . . . . . 34
CONTENTS 4
6 Frequentist statistics 35
7 Linear regression 36
7.1 Behavior of training set error with increasing sample size . . 36
7.2 Multi-output linear regression . . . . . . . . . . . . . . . . . . 36
7.3 Centering and ridge regression . . . . . . . . . . . . . . . . . 36
2
7.4 MlE for σ for linear regression . . . . . . . . . . . . . . . . . 37
7.5 MLE for the offset term in linear regression . . . . . . . . . . 37
7.6 MLE for simple linear regression . . . . . . . . . . . . . . . . 38
7.7 Sufficient statistics for online linear regression . . . . . . . . . 38
7.8 Bayesian linear regression in 1d with known σ 2 . . . . . . . . 38
7.9 Generative model for linear regression . . . . . . . . . . . . . 39
7.10 Bayesian linear regression using the g-prior . . . . . . . . . . 40
8 Logistic regression 42
8.1 Spam classification using logistic regression . . . . . . . . . . 42
8.2 Spam classification using naive Bayes . . . . . . . . . . . . . . 42
8.3 Gradient and Hessian of log-likelihood for logistic regression . 42
8.4 Gradient and Hessian of log-likelihood for multinomial logis-
tic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.5 Symmetric version of l2 regularized multinomial logistic re-
gression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.6 Elementary properties of l2 regularized logistic regression . . 44
8.7 Regularizing separate terms in 2d logistic regression . . . . . 45
13.7 Prior for the Bernoulli rate parameter in the spike and slab
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
13.8 Deriving E step for GSM prior . . . . . . . . . . . . . . . . . 68
13.9 EM for sparse probit regression with Laplace prior . . . . . . 68
13.10 GSM representation of group lasso* . . . . . . . . . . . . . . 70
13.11 Projected gradient descent for l1 regularized least squares . 70
13.12 Subderivative of the hinge loss function . . . . . . . . . . . . 71
13.13 Lower bounds to convex functions . . . . . . . . . . . . . . . 71
14 Kernels 72
15 Gaussian processes 73
15.1 Reproducing property . . . . . . . . . . . . . . . . . . . . . . 73
21 Variational inference 87
21.1 Laplace approximation to p(µ, log σ|D) for a univariate Gaus-
sian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
21.2 Laplace approximation to normal-gamma . . . . . . . . . . . 88
21.3 Variational lower bound for VB for univariate Gaussian . . . 88
21.4 Variational lower bound for VB for GMMs . . . . . . . . . . . 89
21.5 Derivation of E[log πk ] . . . . . . . . . . . . . . . . . . . . . . 91
21.6 Alternative derivation of the mean field updates for the Ising
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
21.7 Forwards vs reverse KL divergence . . . . . . . . . . . . . . . 92
21.8 Derivation of the structured mean field updates for FHMM . 92
21.9 Variational EM for binary FA with sigmoid link . . . . . . . . 93
21.10 VB for binary FA with probit link . . . . . . . . . . . . . . . 93
1 INTRODUCTION 8
1 Introduction
1.1 Constitution of this document
Here we should have demonstrated the solution to problems in Chapter
One in Machine Learning, A Probabilistic Perspective(MLAPP). Since the
number of problem in Chapter is zero, we save this section as an introduction
to this document, i.e.a solution manual.
This document provides detailed solution to almost all problems of
textbook MLAPP from Chapter One to Chapter Fourteen(Chinese version)
/ Twenty-one(English version). We generally save the restatement of prob-
lems for readers themselves.
There are two class for problems in MLAPP: theortical inference and
pratical projects. We provide solution to most inference problems apart
from those which are nothing but straightforward algebra(and few which
we failed to solve). Practical problems, which base on a Matlab toolbox,
are beyond the scope of this document.
should read from a critical perspective and not hesitate to believe in ev-
erything I have written down. In the end, I hope that readers can provide
comments and revise opinions. Apart from correcting the wrong answers,
those who good at using MATLAB, Latex typesetting or those who are will-
ing to participate in the improvement of the document are always welcome
to contact me.
22/10/2017
Fangqi Li
Munich, Germany
[email protected]
[email protected]
1 INTRODUCTION 11
2 Probability
2.1 Probability are sensitive to the form of the question
that was used to generate the answer
Denote two children by A and B.
Use
E1 : A = boy, B = girl
E2 : B = boy, A = girl
E3 : A = boy, B = boy
In question a:
1
P (E1 ) = P (E2 ) = P (E3 ) =
4
P (E1 ) + P (E2 ) 2
P (OneGirl|OneBoy) = =
P (E1 ) + P (E2 ) + P (E3 ) 3
For question b,w.l.o.g, assume child A is seen:
1
P (B = girl|A = boy) =
2
var[X + Y ] =E[(X + Y )2 ] − E2 [X + Y ]
=E[X 2 ] − E2 [X] + E[Y 2 ] − E2 [Y ] + 2E[XY ] − 2E2 [XY ]
=var[X] + var[Y ] + 2cov[X, Y ]
2 PROBABILITY 13
In the last step we summarize over the potential location of the prize.
h(y, z) = p(y|z)
And:
X
1= p(x, y|z)
x,y
X X
=( g(x, z))( h(y, z))
x y
2 PROBABILITY 15
Thus:
X X
p(x|z)p(y|z) =g(x, z)h(y, z)( g(x, z))( h(y, z))
x y
=g(x, z)h(y, z)
=p(x, y|z)
IG(y) =Ga(x) · y −2
ba 1 (a−1)+2 − yb
= ( ) e
Γ(a) y
ba b
= (y)−(a+1) e− y
Γ(a)
X p(x, y)
I(X; Y ) = p(x, y) log
x,y
p(x)p(y)
X p(x|y)
= p(x, y) log
x,y
p(x)
X XX
= p(x, y) log p(x|y) − ( p(x, y)) log p(x)
x,y x y
= − H(X|Y ) + H(X)
We use the weak law of large numbers in the third step and drop the
entropy of empirical distribution in the last step.
Therefore:
Z
E[m] = x · p(m = x)dx
Z
= p(m > x)dx
Z 1
= (1 − x)2 dx
0
1
=
3
3 GENERATIVE MODELS FOR DISCRETE DATA 19
Log-Likelihood:
ln p(D|θ) = N1 ln θ + N0 ln(1 − θ)
Prior distribution:
Posterior distribution:
Prediction:
Z
p(xnew = 1|D) = p(xnew = 1|θ) · p(θ|D)dθ
Z
= θp(θ|D)dθ
N1 + a
=E(θ) =
N1 + a + N0 + b
3 GENERATIVE MODELS FOR DISCRETE DATA 20
p(θ) =∝ Beta(θ|0, 0)
3 GENERATIVE MODELS FOR DISCRETE DATA 21
Thus: PN
n=1 xn
λ=
N
p(λ|D) ∝p(λ)p(D|λ)
PN
∝ exp(−λ(N + b)) · λ n=1 xn +a−1
X
=Gamma(a + x, N + b)
Thus in question a:
N
θM L = PN
xn n=1
We skip other questions and state that the conjugate prior for expo-
nential distribution is Gamma distribution:
p(θ|D) ∝p(θ)p(D|θ)
=Gamma(θ|a, b)p(D|θ)
X
=Gamma(θ|N + a, b + xn )
3 GENERATIVE MODELS FOR DISCRETE DATA 23
We already have:
p(θ) = Beta(1, 1)
Thus:
Z 1
N
p(N1 |N ) = θN1 (1 − θ)N −N1 dθ
N1
0
N
= B(N1 + 1, N − N1 + 1)
N1
N! N1 !(N − N1 )!
=
N1 !(N − N1 )! (N + 1)!
1
=
N +1
Where B is the regulizer for a Beta distribution:
Γ(a)Γ(b)
B(a, b) =
Γ(a + b)
In a succint way:
Where:
φ(xi ) = (xi , 1)T
W
θc1 X
βc = (log , ... log(1 − θcw ))T
1 − θc1 w=1
For question a:
p(c = 1|xi ) p(c = 1)p(xi |c = 1)
log = log
p(c = 2|xi ) p(c = 2)p(xi |c = 2)
p(xi |c = 1)
= log
p(xi |c = 1)
=φ(xi )T (β1 − β2 )
Hence:
θc=1,w = θc=2,w
For question b and c, we generally think that naive models fit better
when N is large, because delicate models have problems of overfitting.
In question d,e and f, it is assumed that looking up for a value according
to a D-dimensional index cost O(D) time. It is easy to calculate the fitting
complexity: O(N D) for a naive model and O(N · 2D ) for a full model, and
the applying complexity is O(CD) and O(C · 2D ) respectively.
For question f:
X
p(y|xv ) ∝ p(xv |y) = p(xv , xh |y)
xh
For binary features, consider the value of xj to be zero and one, given
πc = p(y = c), θjc = p(xj = 1|y = c), θj = p(xj = 1):
X p(xj = 1, c)
Ij = p(xj = 1, c) log
c
p(xj = 1)p(c)
X p(xj = 0, c)
+ p(xj = 0, c) log
c
p(xj = 0)p(c)
X θjc 1 − θjc
= πc θjc log + (1 − θjc )πc log
c
θj 1 − θj
Which ends in 3.76.
3 GENERATIVE MODELS FOR DISCRETE DATA 27
4 Gaussian models
4.1 Uncorrelated does not imply independent
We first calculate the covariance of X and Y :
Z Z
cov(X, Y ) = (X − E(X))(Y − E(Y ))p(X, Y )dXdY
Z 1
1
= X(X 2 − √ )dX = 0
−1 3
The intergral ends in zero since we are intergrating an odd function in
range [-1,1], hence:
cov(X, Y )
ρ(X, Y ) = p =0
var(X)var(Y )
Equals:
|ρ(X, Y )| ≤ 1
4 GAUSSIAN MODELS 29
We also have:
var(X)var(Y ) = a2 · var(X)
These two make:
a
ρ(X, Y ) =
|a|
p(µ|a, b) ∝ exp a · µ2 + b · µ
p(µ|Y ) ∝ exp (A + a) · µ2 + (B + b) · µ
According to 4.195:
N
X N
X
(xi − µ)T Σ−1 (xi − µ) = (x̄ − µ + (xi − x̄))T Σ−1 (x̄ − µ + (xi − x̄))
n=1 n=1
N
X
=N (x̄ − µ)T Σ−1 (x̄ − µ) + (xi − x̄)T Σ−1 (xi − x̄)
n=1
( N
)
X
T −1 −1
=N (x̄ − µ) Σ (x̄ − µ) + T r Σ (xi − x̄)(xi − x̄)T
n=1
T −1 −1
=N (x̄ − µ) Σ (x̄ − µ) + T r Σ Sx̄
vX = v0 + N
N x̄ + k0 m0
mX =
kX
4 GAUSSIAN MODELS 32
Hence
Hence the posterior distribution for MVN takes the form:N IW (mX , kX , vX , SX )
4.12
Straightforward calculation.
p(x) = N (x|µ, σ 2 = 4)
2 σ02 σ 2
σpost =
σ 2 + nσ02
Since 0.95 of probability mass for a normal distribution lies within
−1.96σ and 1.96σ, we have:
n ≥ 611
p(µ|X) ∝ p(µ)p(X|µ)
1
=
)xT Σ−1
D
− 21 (1 1
k 2 exp − k 0 x + xT u + c + 1
Where we have used:
5 Bayesian statistics
5.1 Proof that a mixture of conjugate priors is indeed
conjugate
For 5.69 and 5.70, formly:
X X
p(θ|D) = p(θ, k|D) = p(k|D)p(θ|k, D)
k k
Where:
p(k, D) p(k)p(D|k)
p(k|D) = =P 0 0
p(D) k0 p(k )p(D|k )
ĉ = argmaxc {p(c|x)}
ρĉ = (1 − p(ĉ|x)) · λs
5 BAYESIAN STATISTICS 38
ρreject = λr
ρĉ ≥ ρreject
Or:
λr
p(ĉ|x) ≥ 1 −
λs
We have:
Z Q Z +∞
∂
E(π|Q) = P Qf (Q)−C f (D)dD−CQf (Q)+(P −C) f (D)dD−(P −C)Qf (Q)
∂Q 0 Q
RQ R +∞
Set it to zero by making use of 0
f (D)f D + Q
f (D)dD = 1:
Q∗
P −C
Z
= F (Q∗ ) =
0 P
We also have:
Substract the right side from the left side ends in:
−KL(pBM A ||pm ) ≤ 0
Hence the left side is always smaller than the right side.
p(x = 0, y = 0) = (1 − θ1 )θ2
p(x = 0, y = 1) = (1 − θ1 )(1 − θ2 )
p(x = 1, y = 0) = θ1 (1 − θ2 )
p(x = 1, y = 1) = θ1 θ2
Which can be concluded as:
Hence:
1 − θ1 θ1 θ2
θM L = argmaxθ (N ln( ) + Nx ln( ) + NI(x=y) ln( ))
1 − θ2 1 − θ1 1 − θ2
Two parameters can be estimated independently given X and Y.
We can further rewrite the joint distribution into:
Then
X
θM L = argmaxθ ( Nx,y ln θx,y )
x,y
6 Frequentist statistics
The philosophy behind this chapter is out of the scope of probabilistic
ML, you should be able to find solutions to the four listed problems in a
decent textbook on mathematics statistics.
GL.
7 LINEAR REGRESSION 42
7 Linear regression
7.1 Behavior of training set error with increasing sample
size
When the training set is small at the beginning, the trained model
is over-fitted to the current data set, so the correct rate can be relatively
high. As the training set increases, the model has to learn to adapt to more
general-purpose parameters, thus reducing the overfitting effect laterally,
resulting in lower accuracy.
As pointed out in Section 7.5.4, increasing the training set is an impor-
tant method of countering over-fitting besides adding regulizer.
So:
∂
N LL(w) = 2XT Xw − 2XT y + 2λw
∂w
Therefore:
w = (XT X + λI)−1 XT y
7 LINEAR REGRESSION 43
p(D|w, σ 2 ) =p(y|w, σ 2 , X)
N
Y
= p(yn |xn , w, σ 2 )
n=1
N
Y
= N (yn |wT xn , σ 2 )
n=1
( N
)
1 1 X T 2
= N exp − 2 (yn − w xn )
(2πσ 2 ) 2 2σ n=1
As for σ 2 :
N
∂ 2 N 1 X
log p(D|w, σ ) = − + (yn − wT xn )2
∂σ 2 2σ 2 2(σ 2 )2 n=1
We have:
N
2 1 X
σM L = (yn − wT xn )2
N n=1
N
1 X
w0,M L = (yn − wT xn ) = ȳ − wT x̄
N n=1
Centering within X and y:
Xc = X − X̂
yc = y − ŷ
7 LINEAR REGRESSION 44
(n+1) (n)
(n + 1)Cxy = nCxy + xn+1 yn+1 + nx̄(n) ȳ (n) − (n + 1)x̄(n+1) ȳ (n+1)
1
Expand the Cxy in two sides and use x̄(n+1) = x̄(n) + n+1
(xn+1 − x̄n ).
Problem e and f: practice by yourself.
Reduce it into:
p(w) = N (w|w0 , V0 ) ∝
1 1 −1
exp − V−1 (w 0 − w 00 )2
− V (w 1 − w 01 )2
− V −1
(w 0 − w 00 )(w 1 − w 01 )
2 0,11 2 0,22 0,12
Formly, we take:
w01 = 0
V−1
0,22 = 1
V−1 −1
0,11 = V0,12 = 0
7 LINEAR REGRESSION 45
w00 = arbitrary
In problem c, we consider the posterior distribution for parameters:
N
Y
p(w|D, σ 2 ) = N (w|m0 , V0 ) N (yn |w0 + w1 xn , σ 2 )
n=1
ΣXX = X T X
ΣY X = Y T X
Using the conclusion from section 4.3.1:
p(Y |X = x) = N (Y |µY |X , ΣY |X )
Where:
µY |X = µY + ΣY X Σ−1 T T
XX (X − µX ) = Y X(X X)
−1
X = wT X
7 LINEAR REGRESSION 46
ba0 0 (w − w0 )T V−1
2 −(a0 + D
2 +1) ·exp
0 (w − w0 ) + 2b0
∝ D 1 (σ ) − ·
(2π) 2 |V0 | 2 Γ(a0 ) 2σ 2
( P )
N
2 −N n=1 (yn − wT xn )2
(σ ) 2 · exp −
2σ 2
Comparing the coefficient of σ 2 :
N
aN = a0 +
2
Comparing the coefficient of wT w:
N
X
V−1 −1
N = V0 + xn xTn = V−1 T
0 +X X
n=1
Thus:
wN = VN (V−1 T
0 w0 + X y)
7 LINEAR REGRESSION 47
8 Logistic regression
8.1 Spam classification using logistic regression
Practice by yourself.
∂ exp(−a) 1 e−a
σ(a) = = = σ(a)(1 − σ(a))
∂a (1 + exp(−a)) 2 1 + e 1 + e−a
−a
∂
g(w) = N LL(w)
∂w
N
X ∂
= [yi log µi + (1 − yi ) log(1 − µi )]
n=1
∂w
N
X 1 −1
= yi σ(1 − σ) − xi + (1 − yi ) σ(1 − σ) − xi
n=1
σ 1−σ
N
X
= (σ(wT xi ) − yi )xi
n=1
vT Sv > 0
So XT SX is positive definite.
8 LOGISTIC REGRESSION 49
N N
X exp(w∗T xn ) X
−yn∗ xn + PC )xn = (µn∗ − yn∗ )xn
exp(wTx
n=1 c=1 c n n=1
Combine the independent solutions for all classes into one matrix yield
8.38.
On soloving for Hessian matrix, consider to take gradient w.r.t w1 and
w2 :
N
∂ X
H1,2 = ∇w2 ∇w1 N LL(W) = (µn1 − yn1 )xn
∂w2 n=1
When w1 and w2 are the same:
N N
∂ X T
X ∂
(µn1 − yn1 )xn = µn1 xTn
∂w1 n=1 n=1
∂w 1
N
exp(wT1 xn )( exp)xn − exp(wT1 xn )2 xn T
X P
P xn
n=1
( exp)2
N
X
= µn1 (1 − µn1 )xn xTn
n=1
N
X
= −µn1 µn2 xn xTn
n=1
Ends in 8.44。
P
The condition c ync = 1 is used from 8.34 to 8.35.
8 LOGISTIC REGRESSION 50
p(w|D) ∝ p(D|w)p(w)
p(w) = N (w|0, σ −2 I)
1 T
N LL(w) = − log p(w|D) = − log p(D|w) + w w+c
2σ 2
Therefore:
1
λ=
2σ 2
The number of zero in global optimun is related to the value of λ, which
is in a negative correlationship with the prior uncertainty of w. The less
the uncertainty is, the more that w converges to zero, which ends in more
zeros in answer.
If λ = 0, which implies prior uncertainty goes to infnity. Then posterior
estimation converges to MLE. As long as there is no constraint on w, it is
possible that some component of w goes to infinity.
When λ increase, the prior uncertainty reduces, hence the over-fitting
effect reduces. Generally this implide a decrease on training-set accuracy.
8 LOGISTIC REGRESSION 51
At the same time, this also increases the accuracy of model on test-set, but
it does not always happen.
Thus two distributions are equal with naive change of variables’ names.
XN
Q(θ, θold ) =Ep(z|D, θold )[ log(xn , zn |θ)]
n=1
N
X K
Y
= E[log (πk p(xn |zk , θ))znk ]
n=1 k=1
N X
X K
= rnk log(πk p(xn |zk , θ))
n=1 k=1
Where:
rnk = p(znk = 1|xn , θold )
11 MIXTURE MODELS AND THE EM ALGORITHM 56
Obtain 11.32:
PN
n=1 rnk (xn − µk )(xn − µk )T
Σk = PN
n=1 rnk
P
Introduce a multipler to constrain j µkj = 1, then condition for the
derivative to be zero is:
PN
n=1 rnk xnj
µkj =
λ
Summer over all j:
D D N N D PN
X 1 XX 1X X rnk
1= µkj = rnk xnj = rnk xnj = n=1
j=1
λ j=1 n=1 λ n=1 j=1
λ
Results in:
N
X
λ= rnk
n=1
Hence 11.116。
Introduce a prior:
α−1 β−1
p(µk0 ) ∝ µk0 µk1
N
X
λ= rnk + α + β − 2
n=1
Hence 11.117。
v v v v v
log( ) − log(Γ( )) + ( − 1) log(z) − z
2 2 2 2 2
Sum the terms involving v:
v v v v
lv (x, z) = log( ) − log(Γ( )) + (log(z) − z)
2 2 2 2
The likelihood w.r.t v on complete data set is:
N
vN v v vX
Lv = log( ) − N log(Γ( )) + (log(zn ) − zn )
2 2 2 2 n=1
For µ and Σ:
1 z
lµ,Σ (x, z) = − log |Σ| − (x − µ)T Σ−1 (x − µ)
2 2
N
N 1X
Lµ,Σ = log |Σ| − zn (xn − µ)T Σ−1 (xn − µ)
2 2 n=1
Hence equals the MLE used for MVN.
N
X
l(θ) = log p(xn |θ)
n=1
Deriavte w.r.t µk :
N
πk N (xn |µk , Σk )∇µk − 12 (xn − µk )T Σ−1
∂ k (xn − µk )
X
l(θ) = PK
∂µk n=1 k0 =1 πk N (xn |µk , Σk )
0 0 0
N
X
= rnk Σ−1
k (xn − µk )
n=1
11 MIXTURE MODELS AND THE EM ALGORITHM 59
w.r.t πk :
N N
∂ X N (xn |µk , Σk ) 1 X
l(θ) = PK = rnk
∂πk n=1 k0 =1 πk N (xn |µk , Σk )
0 0 0 πk n=1
Where:
1 1 1 T −1
∇Σk N (x|µk , Σk ) = D 1 exp − (x − µ k ) Σk (x − µ k ) ∇Σk ·
(2π) 2 |Σk | 2 2
1 −1 −1
∇Σk (− (x − µk )Σk (x − µk )) − Σk ∇Σk |Σk |
2
= N (x|µk , Σk )∇(log N (x|µk , Σk ))
Thus we have:
PN
n=1 rnk (xn − µk )(xn − µk )T
Σk = PN
n=1 rnk
p(Jn = j, Kn = k, xn )
p(Jn = j, Kn = k|xn ) =
p(xn )
p(Jn = j)p(Kn = k)p(xn |Jn = j, Kn = k)
= P
Jn ,Kn p(Jn , Kn , xn )
pj qk N (xn |µj , σk2 )
= Pm Pl 2
Jn =1 Kn =1 pJn qKn N (xn |µJn , σK n
)
11 MIXTURE MODELS AND THE EM ALGORITHM 60
X m X
N X l
= E(I(Jn = j, Kn = k))(log pj + log qk + log N (xn |µj , σk2 ))
n=1 j=1 k=1
X X X
= rnjk log pj + rnjk log qk + rnjk log N (xn |µj , σk2 )
n,j,k n,j,k njk
X Zk
= πk xxT N (x|µk , Σk )dx
k
Where:
Z
xxT N (x|µk , Σk )dx =EN (µk ,Σk ) (xxT )
=covN (µk ,Σk ) (x) + EN (µk ,Σk ) (x)EN (µk ,Σk ) (x)T
=Σk + µk µTk
Therefore:
X
cov(x) = πk (Σk + µk µTk ) − E(x)E(x)T
k
=n2k s2 + nk (nk s2 )
=2nk sk
Thus 11.131.
11 MIXTURE MODELS AND THE EM ALGORITHM 62
C
Y 1 1
p(x|z) = (p 2
exp − 2 (x − µc ) )zc
c=1
2πσc
2 2σc
The log for joint distribution is:
C
Y πc 1
log p(x, z) = log (p exp − 2 (x − µc )2 )zc
c=1
2πσc2 2σc
C
X 1 1
= zc (log πc − log 2πσc2 − 2 (x − µc )2 )
c=1
2 2σc
This means:
XN N
X
wT = ( zn yn xTn )( zn xn xTn )−1
n=1 n=1
11 MIXTURE MODELS AND THE EM ALGORITHM 63
t2 satisfies:
D
X (t2 + σ 2 ) − (x̄j − µ)2
j=1
(t2 + σj2 )2
And we have:
Z +∞
1 φ(A)
E[i |i = i N (i |0, 1)dx = = H(A)
p(i ≥ A) A 1 − Φ(A)
In the last step we use 11.141 and 11.139, plug it up:
+∞
1 − Φ(A) + Aφ(A)
Z
1
E[2i |i ≥ A] = w2 N (w|0, 1)dw =
p(i ≥ A) A 1 − Φ(A)
Plug it into the conclusion drawn from question a:
1 − Φ(A) + Aφ(A)
E[zi2 |zi ≥ ci ] = µ2i + 2µi σH(A) + σ 2
1 − Φ(A)
p(x|z) = N (x|Wz, Ψ)
And:
p(z|x) = N (z|m, Σ)
Σ = (I + WT Ψ−1 W)−1
m = ΣWT Ψ−1 xn
With prior log p(z) that can be omitted with parameter 0 and I, hence:
XN
Q(θ, θold ) =Eθold [ log p(xn |zn , θ)]
n=1
N
X 1 1
=E[ c− log |Ψ| − (xn − Wzn )T Ψ−1 (xn − Wzn )]
n=1
2 2
N
N 1X
=C − log |Ψ| − E[(xn − Wzn )T Ψ−1 (xn − Wzn )]
2 2 n=1
N N N
N 1 X T −1 1X X
=C − log |Ψ| − xn Ψ xn − E[zTn WT Ψ−1 Wzn ] + xTn Ψ−1 WE[zn ]
2 2 n=1 2 n=1 n=1
N N N
N 1X 1X X
xn Ψ−1 xn − T r WT Ψ−1 WE[zn zTn ] + xTn Ψ−1 WE[zn ]
=C − log |Ψ| −
2 2 n=1 2 n=1 n=1
E[zn zTn |xn ] = cov(zn |xn )+E[zn |xn ]E[zn |xn ]T = Σ+(ΣWT Ψ−1 x)(ΣWT Ψ−1 x)T
From now on, the x and θold are omitted from conditions when calcu-
lating expectation.
Optimize w.r.t W:
N N
∂ X X
Q= Ψ−1 xn E[zn ]T − Ψ−1 WE[zn zTn ]
∂W n=1 n=1
Set it to zero:
XN N
X
W=( xn E[zn ]T )( E[zn zTn ])−1
n=1 n=1
zm2 = vT2 xm
Where:
Λ = diag {λ1 , λ2 , ...}
OT = {u1 , u2 , ...}
Are eigenvectors, that are vertical to each other uTi uj = I(i = h).
Withu1 = v1 .
Under constrains vT2 v2 = 1 and vT2 v1 = 0, we are to minimize:
(Ov2 )T Λ(Ov2 )
uTi v2 = I(i = 2)
12 LATENT LINEAR MODELS 68
Therefore:
v2 = u2
K
X K
X K
X
2 T
||xn − znj vj || =(xn − znj vj ) (xn − znj vj )
j=1 j=1 j=1
N
X N
X
=xTn xn + 2
znj − 2xTn znj vj
j=1 j=1
Plug in vTj Cvj = λj and sum over n can draw the conclusion in b.
Plug K = d into the conclusion in b, we have:
N d
1 X T X
JK=d = xn xn − λj = 0
N n=1 j=1
N d
1 X T X
xn xn − λj = 0
N n=1 j=1
In general cases:
d
X K
X K
X
JK = λj − λj = λj
j=1 j=1 j=d+1
p(x|z) = N (x|Wz, σ 2 I)
12.11 PPCA vs FA
Practice by youself.
13 SPARSE LINEAR MODELS 70
Straightforwardly:
N
∂ X
RSS(w) = 2(yn − wT xn )(−xnj )
∂wj n=1
N
X D
X
=− 2(xnj yn − xnj wi xni )
n=1 i=1
N
X D
X
=− 2(xnj yn − xnj wi xni − x2nj wj )
n=1 i6=j
With wj ’s coefficient:
N
X
aj = 2 x2nj
n=1
In the end:
cj
wj =
aj
p(y|x, w, β) = N (y|Xw, β −1 )
A = diag(α)
13 SPARSE LINEAR MODELS 71
Using XT X = I:
RSS(w) = c + wT w − 2yT Xw
We have:
N
X
ŵkOLS = yn xnk
n=1
In ridge regression:
Thus PN
yn xnk
ŵkridge = n=1
1+λ
Solution for lasso regression using subderivative is exploited in 13.3.2,
which concludes in 13.63:
λ
ŵklasso = sign(ŵkOLS )(|ŵkOLS | − )+
2
Observe picture 13.24, it is easy to address the black line as OLS, gray
one Ridge and dotted one lasso. And λ1 = λ2 = 1. It is noticeable that
ridge cause a shrinkage to horizontal axis while lasso cause a sharp shrinkage
to zero under certain threshold.
D
Y
p(γ|α1 , α2 ) = p(γd |α1 , α2 )
d=1
Integrate out πd :
Z
1
p(γd |α1 , α2 ) = p(γd |πd )p(πd |α1 , α2 )dπd
B(α1 , α2 )
Z
1
= πdγd (1 − πd )(1−γd ) πdα1 −1 (1 − πd )α2 −1 dπd
B(α1 , α2 )
Z
1
= πdα1 +γd −1 (1 − πd )α2 +1−γd −1 dπd
B(α1 , α2 )
B(α1 + γd , α2 + 1 − γd ) Γ(α1 + α2 ) Γ(α1 + γd )Γ(α2 + 1 − γd )
= =
B(α1 , α2 ) Γ(α1 )Γ(α2 ) Γ(α1 + α2 + 1)
13 SPARSE LINEAR MODELS 74
γ2
Z
1
Lap(wj |0, ) = N (wj |0, τj2 )Ga(τj2 |1, )dτj2
γ 2
Take Laplace transform/generating transform to both sides:
To calculate:
1 p(wj |τj2 )p(τj2 ) 2
Z Z
1 1 2 2
E[ |wj ] = p(τ |w j )dτ = dτj
τj2 τj2 j j
τj2 p(wj )
Z
1 1
= N (wj |0, τj2 )p(τj2 )dτj2
p(wj ) τj2
Because:
d 1 d
log p(w) = p(w)
dw p(w) dw
This gives 13.197:
1 −1 d 1 d
p(wj ) = − log p(wj )
p(wj ) |wj | dwj |wj | dwj
Hence:
D
( ) D
2 1 X 2 2 wd2 Y 1
p(τ , w, y|X, γ) ∝ exp − (γ τd + 2 ) ·
2 d=1 τd d=1
τd
N
Y
· Φ(wT xn )yn (1 − Φ(wT xn ))1−yn
n=1
new old
In Q(θ ,θ ), we take expectation of θold . We have assumed w as
parameter and τ 2 as latent variable, thus:
With u ≥ 0, v ≥ 0, then:
Denote: !
u
z=
v
Rewrite the original target:
1
minz f (z) = cT z + zT Az
2
13 SPARSE LINEAR MODELS 77
s.t.z ≥ 0
Where: !
λ1n − yX
c=
λ1n + yX
!
XT X −XT X
A=
−XT X XT X
The gradient is given by:
∇f (z) = c + Az
During iteration:
zk+1 = zk − gk
The original paper suggest more delicate method to moderate the learn-
ing rate, refer to ”Gradient Projection for Sparse Reconstruction: Applica-
tion to Compressed Sensing and Other Inverse Problems, Mario A.T.Figueiredo”.
14 Kernels
15 GAUSSIAN PROCESSES 79
15 Gaussian processes
15.1 Reproducing property
We denote κ(x1 , x) by f (x) and κ(x2 , x) by g(x). From definition:
∞
X
f (x) = fi φ(x)
i=1
∞
X
κ(x1 , x) = λi φi (x1 )φi (x)
i=1
fi = λi φi (x1 )
gi = λi φi (x2 )
Therefore:
=κ(x1 , x2 )
16 ADAPTIVE BASIS FUNCTION MODELS 80
X Ti X
N X K
+ I[zi,t = k] log p(xi,t |zi,t = k, B)]
i=1 t=1 k=1
i=2
|Qi | 2
( Ni
)
1X T −1
+ − (zn,i − Ai zn,i−1 − Bi ui ) Qi (zn,i − Ai zn,i−1 − Bi ui ) ]
2 n=1
T
X 1
+ Ni log 1
i=2
|Ri | 2
( Ni
)
1X T −1
+ − (yn,i − Ci zn,i − Di ui ) Ri (yn,i − Ci zn,i − Di ui ) ]
2 n=1
When exchanging the order of sum over data, we have T = maxn {Tn }
and Ni denotes the number of data set with size no more than i.
To estimate µ0 , take the related terms:
N
1X
E[− (zn,1 − µ0 )Σ−1
0 (zn,1 − µ0 )]
2 n=1
It is straightforward to give:
∂ log Z(θ) ∂ XY
= log ψc (yc |θc )
∂θc0 ∂θc0 y c∈C
1 X ∂ Y
= ψc (yc |θc )
Z(θ) y ∂θc0 c∈C
1 X Y ∂
= ψc (yc |θc ) ψc0 (yc0 |θc0 )
Z(θ) y c∈C,c6=c0 ∂θc0
1 X Y ∂
exp θcT0 φc0 (yc0 )
= ψc (yc |θc )
Z(θ) y c∈C,c6=c0 ∂θc0
1 XY
= ψc (yc |θc )φc0 (yc0 )
Z(θ) y c∈C
X 1 Y
= φc0 (yc0 ) ψc (yc |θ)
y
Z(θ) c∈C
X
= φc0 (yc0 )p(y|θ)
y
And:
2 −1 0
Λ = Σ−1 =
−1 2 −1
0 −1 2
Thus we have independency: X1 ⊥ X2 |X3 . This introduces a MRF
like:
X1 X3 X2
X1 X3
X2
X1 X2 X3
19 UNDIRECTED GRAPHICAL MODELS(MARKOV RANDOM FIELDS)88
X1 X2 X3
Problem e:
The answer can be derived from the conclusion of marginal Gaussian
directly, A is true while B not.
O(r(N c + 1))
and
O(r(N c + N ))
19 UNDIRECTED GRAPHICAL MODELS(MARKOV RANDOM FIELDS)89
p(xk = 1, x−k )
p(xk = 1|x−k ) =
p(x−k )
p(xk = 1, x−k )
=
p(xk = 0, x−k ) + p(xk = 1, x−k )
1
= p(xk =0,x−k )
1 + p(xk =1,x−k )
1
= Q
exp(hk ·0) <k,i> exp(Jk,i ·0)
1+ Q
exp(hk ·1) <k,i> exp(Jk,i ·xi )
n
X
=σ(hk + Jk,i xi )
i=1,i6=k
N (x|µ1 , λ−1 −1
1 ) × N N (x|µ2 , λ2 )
√
λ1 λ2 λ1 λ2
= exp − (x − µ1 )2 − (x − µ2 )2
2π 2 2
√
λ1 µ21 + λ2 µ22
λ1 λ2 λ1 + λ2 2
= exp − x + (λ1 µ1 + λ2 µ2 )x −
2π 2 2
λ1 µ21 + λ2 µ22
λ1 + λ2 2
exp − x + (λ1 µ1 + λ2 µ2 )x −
2 2
λ
=c · exp − (x − µ)2
2
Where:
λ = λ1 + λ2
µ = λ−1 (λ1 µ1 + λ2 µ2 )
X
p(G1 = 1, X2 = 50) = p(G1 ) p(G2 |G1 = 1)p(X2 = 50|G2 )
G2
20 EXACT INFERENCE FOR GRAPHICAL MODELS 91
Thus:
0.45 + 0.05 · exp(−5)
p(G1 = 1|X2 = 50) = ≈ 0.9
0.5 + 0.5 · exp(−5)
Problem b(here X denotes X2 or X3 ):
Hence:
P (G1 = 1|X2 = 50, X3 = 60) = 0.5
21 Variational inference
21.1 Laplace approximation to p(µ, log σ|D) for a univari-
ate Gaussian
Laplace approximation equals representing f (µ, l) = log p(µ, l = log σ|D)
with second-order Taylor expansion. We have:
Thus we derive:
N
∂ log p(µ, l|D) 1 1 X
= 2 · (yn − µ)
∂µ 2 exp {2 · l} n=1
N
= · (ȳ − µ)
σ2
N
∂ log p(µ, l|D) 1X 1
=−N + (yn − µ)2 · (−2) ·
∂l 2 n=1 exp {2 · l}
N
1 X
=−N + (yn − µ)2
σ 2 n=1
∂ 2 log p(µ, l|D) N
2
=− 2
∂µ σ
N
∂ 2 log p(µ, l|D) 2 X
= − (yn − µ)2
∂l2 σ 2 n=1
∂ 2 log p(µ, l|D) 1
=N · (ȳ − µ) · (−2) · 2
∂µ∂l σ
21 VARIATIONAL INFERENCE 94
=N s2 + N (µ − ȳ)2
PN
Where s2 = N1 n=1 (yn − ȳ)2
Conclusions in all problems a, b and c are included in the previous
solution.
For 21.211:
For 21.212:
For 21.213:
For 21.214:
For 21.215:
θ=α
And:
PK K
∂A(α) Γ0 (αk ) Γ0 ( i=1 αk ) X
E[log πk ] = = − PK = ψ(αk ) − ψ( αi )
∂αk Γ(αk ) Γ( i=1 αk ) i=1
Thus:
K
T X T M M
X 1 X X
T −1
X
− xt,m,k ˜t,m,k = E[ (yt − Wl xt,m ) Σ (yt − Wl xt,m )] + C
t=1 k=1
2 t=1 l6=m l6=m
T −1
X 1 T −1
˜t,m,k = Wm Σ (yt − Wl E[xt,l ]) − (Wm Σ Wm )k,k
l6=m
2
log q(zi ) = Eq(xi )q(w) [log p(zi |xi , w) + log p(yi |zi )]