Murphy Book Solution

Download as pdf or txt
Download as pdf or txt
You are on page 1of 100
At a glance
Powered by AI
The document discusses machine learning from a probabilistic perspective and covers topics such as probability, generative models, and Gaussian models.

Some important probability concepts covered include Bayes' rule, conditional independence, mutual information, and properties of distributions like the beta and Gaussian distributions.

Generative models discussed for discrete data include the Bernoulli/binomial model, Beta-Bernoulli model, Poisson distribution, uniform distribution, and Dirichlet-multinomial model.

Machine Learning: A Probabilistic

Perspective Solution Manual Version 1.1


Fangqi Li, SJTU

Contents
1 Introduction 2
1.1 Constitution of this document . . . . . . . . . . . . . . . . . . 2
1.2 On Machine Learning: A Probabilistic Perspective . . . . . . 2
1.3 What is this document? . . . . . . . . . . . . . . . . . . . . . 3
1.4 Updating log . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Probability 6
2.1 Probability are sensitive to the form of the question that was
used to generate the answer . . . . . . . . . . . . . . . . . . . 6
2.2 Legal reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Vriance of a sum . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Bayes rule for medical diagnosis . . . . . . . . . . . . . . . . . 7
2.5 The Monty Hall problem(The dilemma of three doors) . . . . 7
2.6 Conditional Independence . . . . . . . . . . . . . . . . . . . . 7
2.7 Pairwise independence does not imply mutual independence . 8
2.8 Conditional independence iff joint factorizes . . . . . . . . . . 8
2.9 Conditional independence* . . . . . . . . . . . . . . . . . . . 9
2.10 Deriving the inverse gamma density . . . . . . . . . . . . . . 9
2.11 Normalization constant for a 1D Gaussian . . . . . . . . . . . 9
2.12 Expressing mutual information in terms of entropies . . . . . 10
2.13 Mutual information for correlated normals . . . . . . . . . . . 10
2.14 A measure of correlation . . . . . . . . . . . . . . . . . . . . . 10

1
CONTENTS 2

2.15 MLE minimizes KL divergence to the empirical distribution . 11


2.16 Mean, mode, variance for the beta distribution . . . . . . . . 11
2.17 Expected value of the minimum . . . . . . . . . . . . . . . . . 12

3 Generative models for discrete data 13


3.1 MLE for the Beroulli/binomial model . . . . . . . . . . . . . 13
3.2 Marginal likelihood for the Beta-Bernoulli model . . . . . . . 13
3.3 Posterior predictive for Beta-Binomial model . . . . . . . . . 14
3.4 Beta updating from censored likelihood . . . . . . . . . . . . 14
3.5 Uninformative prior for log-odds ratio . . . . . . . . . . . . . 14
3.6 MLE for the Poisson distribution . . . . . . . . . . . . . . . . 15
3.7 Bayesian analysis of the Poisson distribution . . . . . . . . . 15
3.8 MLE for the uniform distrbution . . . . . . . . . . . . . . . . 15
3.9 Bayesian analysis of the uniform distribution . . . . . . . . . 16
3.10 Taxicab problem* . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.11 Bayesian analysis of the exponential distribution . . . . . . . 16
3.12 MAP estimation for the Bernoulli with non-conjugate priors* 17
3.13 Posterior predictive distribution for a batch of data with the
dirichlet-multinomial model . . . . . . . . . . . . . . . . . . . 17
3.14 Posterior predictive for Dirichlet-multinomial* . . . . . . . . . 17
3.15 Setting the hyper-parameters I* . . . . . . . . . . . . . . . . . 17
3.16 Setting the beta hyper-parameters II . . . . . . . . . . . . . . 17
3.17 Marginal likelihood for beta-binomial under uniform prior . . 18
3.18 Bayes factor for coin tossing* . . . . . . . . . . . . . . . . . . 18
3.19 Irrelevant features with naive Bayes . . . . . . . . . . . . . . 18
3.20 Class conditional densities for binary data . . . . . . . . . . . 20
3.21 Mutual information for naive Bayes classifiers with binary
features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.22 Fitting a naive Bayesian spam filter by hand* . . . . . . . . . 21

4 Gaussian models 22
4.1 Uncorrelated does not imply independent . . . . . . . . . . . 22
4.2 Uncorrelated and Gaussian does not imply independent un-
less jointly Gaussian . . . . . . . . . . . . . . . . . . . . . . . 22
CONTENTS 3

4.3 Correlation coefficient is between -1 and 1 . . . . . . . . . . . 22


4.4 Correlation coefficient for linearly related variables is 1 or -1 . 23
4.5 Normalization constant for a multidimensional Gaussian . . . 23
4.6 Bivariate Gaussian . . . . . . . . . . . . . . . . . . . . . . . . 23
4.7 Conditioning a bivariate Gaussian . . . . . . . . . . . . . . . 23
4.8 Whitening vs standardizing . . . . . . . . . . . . . . . . . . . 24
4.9 Sensor fusion with known variances in 1d . . . . . . . . . . . 24
4.10 Derivation of information form formulae for marginalizing
and conditioning . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.11 Derivation of the NIW posterior . . . . . . . . . . . . . . . . 25
4.12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.13 Gaussian posterior credible interval . . . . . . . . . . . . . . . 26
4.14 MAP estimation for 1d Gaussians . . . . . . . . . . . . . . . . 27
4.15 Sequential(recursive) updating of covariance matrix . . . . . . 28
4.16 Likelihood ratio for Gaussians . . . . . . . . . . . . . . . . . . 28
4.17 LDA/QDA on height/weight data . . . . . . . . . . . . . . . 28
4.18 Naive Bayes with mixed features . . . . . . . . . . . . . . . . 28
4.19 Decision boundary for LDA with semi tied covariances . . . . 29
4.20 Logistic regression vs LDA/QDA . . . . . . . . . . . . . . . . 29
4.21 Gaussian decision boundaries . . . . . . . . . . . . . . . . . . 30
4.22 QDA with 3 classes . . . . . . . . . . . . . . . . . . . . . . . . 30
4.23 Scalar QDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Bayesian statistics 31
5.1 Proof that a mixture of conjugate priors is indeed conjugate . 31
5.2 Optimal threshold on classification probability . . . . . . . . 31
5.3 Reject option in classifiers . . . . . . . . . . . . . . . . . . . . 31
5.4 More reject options . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5 Newsvendor problem . . . . . . . . . . . . . . . . . . . . . . . 32
5.6 Bayes factors and ROC curves . . . . . . . . . . . . . . . . . 32
5.7 Bayes model averaging helps predictive accuracy . . . . . . . 32
5.8 MLE and model selection for a 2d discrete distribution . . . . 33
5.9 Posterior median is optimal estimate under L1 loss . . . . . . 34
5.10 Decision rule for trading off FPs and FNs . . . . . . . . . . . 34
CONTENTS 4

6 Frequentist statistics 35

7 Linear regression 36
7.1 Behavior of training set error with increasing sample size . . 36
7.2 Multi-output linear regression . . . . . . . . . . . . . . . . . . 36
7.3 Centering and ridge regression . . . . . . . . . . . . . . . . . 36
2
7.4 MlE for σ for linear regression . . . . . . . . . . . . . . . . . 37
7.5 MLE for the offset term in linear regression . . . . . . . . . . 37
7.6 MLE for simple linear regression . . . . . . . . . . . . . . . . 38
7.7 Sufficient statistics for online linear regression . . . . . . . . . 38
7.8 Bayesian linear regression in 1d with known σ 2 . . . . . . . . 38
7.9 Generative model for linear regression . . . . . . . . . . . . . 39
7.10 Bayesian linear regression using the g-prior . . . . . . . . . . 40

8 Logistic regression 42
8.1 Spam classification using logistic regression . . . . . . . . . . 42
8.2 Spam classification using naive Bayes . . . . . . . . . . . . . . 42
8.3 Gradient and Hessian of log-likelihood for logistic regression . 42
8.4 Gradient and Hessian of log-likelihood for multinomial logis-
tic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.5 Symmetric version of l2 regularized multinomial logistic re-
gression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.6 Elementary properties of l2 regularized logistic regression . . 44
8.7 Regularizing separate terms in 2d logistic regression . . . . . 45

9 Generalized linear models and the exponential family 46


9.1 Conjugate prior for univariate Gaussian in exponential family
form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
9.2 The MVN is in the exponential family . . . . . . . . . . . . . 47

10 Directed graphical models(Bayes nets) 48

11 Mixture models and the EM algorithm 49


11.1 Student T as infinite mixture of Gaussian . . . . . . . . . . . 49
11.2 EM for mixture of Gaussians . . . . . . . . . . . . . . . . . . 49
CONTENTS 5

11.3 EM for mixtures of Bernoullis . . . . . . . . . . . . . . . . . . 50


11.4 EM for mixture of Student distributions . . . . . . . . . . . . 51
11.5 Gradient descent for fitting GMM . . . . . . . . . . . . . . . 52
11.6 EM for a finite scale mixture of Gaussians . . . . . . . . . . . 53
11.7 Manual calculation of the M step for a GMM . . . . . . . . . 54
11.8 Moments of a mixture of Gaussians . . . . . . . . . . . . . . . 54
11.9 K-means clustering by hand . . . . . . . . . . . . . . . . . . . 55
11.10 Deriving the K-means cost function . . . . . . . . . . . . . . 55
11.11 Visible mixtures of Gaussians are in exponential family . . . 56
11.12 EM for robust linear regression with a Student t likelihood . 56
11.13 EM for EB estimation of Gaussian shrinkage model . . . . . 57
11.14 EM for censored linear regression . . . . . . . . . . . . . . . 57
11.15 Posterior mean and variance of a truncated Gaussian . . . . 57

12 Latent linear models 59


12.1 M-step for FA . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
12.2 MAP estimation for the FA model . . . . . . . . . . . . . . . 60
12.3 Heuristic for assessing applicability of PCA* . . . . . . . . . . 61
12.4 Deriving the second principal component . . . . . . . . . . . . 61
12.5 Deriving the residual error for PCA . . . . . . . . . . . . . . 62
12.6 Derivation of Fisher’s linear discriminant . . . . . . . . . . . . 62
12.7 PCA via successive deflation . . . . . . . . . . . . . . . . . . . 62
12.8 Latent semantic indexing . . . . . . . . . . . . . . . . . . . . 63
12.9 Imputation in a FA model* . . . . . . . . . . . . . . . . . . . 63
12.10 Efficiently evaluating the PPCA density . . . . . . . . . . . 63
12.11 PPCA vs FA . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

13 Sparse linear models 64


13.1 Partial derivative of the RSS . . . . . . . . . . . . . . . . . . 64
13.2 Derivation of M-step for EB for linear regression . . . . . . . 64
13.3 Derivation of fixed point updates for EB for linear regression* 66
13.4 Marginal likelihood for linear regression* . . . . . . . . . . . . 66
13.5 Reducing elastic net to lasso . . . . . . . . . . . . . . . . . . . 66
13.6 Shrinkage in linear regression . . . . . . . . . . . . . . . . . . 66
CONTENTS 6

13.7 Prior for the Bernoulli rate parameter in the spike and slab
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
13.8 Deriving E step for GSM prior . . . . . . . . . . . . . . . . . 68
13.9 EM for sparse probit regression with Laplace prior . . . . . . 68
13.10 GSM representation of group lasso* . . . . . . . . . . . . . . 70
13.11 Projected gradient descent for l1 regularized least squares . 70
13.12 Subderivative of the hinge loss function . . . . . . . . . . . . 71
13.13 Lower bounds to convex functions . . . . . . . . . . . . . . . 71

14 Kernels 72

15 Gaussian processes 73
15.1 Reproducing property . . . . . . . . . . . . . . . . . . . . . . 73

16 Adaptive basis function models 74


16.1 Nonlinear regression for inverse dynamics . . . . . . . . . . . 74

17 Markov and hidden Markov models 75


17.1 Derivation of Q function for HMM . . . . . . . . . . . . . . . 75
17.2 Two filter approach to smoothing in HMMs . . . . . . . . . . 75
17.3 EM for HMMs with mixture of Gaussian observations . . . . 76
17.4 EM for HMMs with tied mixtures . . . . . . . . . . . . . . . . 77

18 State space models 78


18.1 Derivation of EM for LG-SSM . . . . . . . . . . . . . . . . . . 78
18.2 Seasonal LG-SSM model in standard form . . . . . . . . . . . 79

19 Undirected graphical models(Markov random fields) 80


19.1 Derivation of the log partition function . . . . . . . . . . . . . 80
19.2 CI properties of Gaussian graphical models . . . . . . . . . . 80
19.3 Independencies in Gaussian graphical models . . . . . . . . . 82
19.4 Cost of training MRFs and CRFs . . . . . . . . . . . . . . . . 82
19.5 Full conditional in an Ising model . . . . . . . . . . . . . . . . 83
CONTENTS 7

20 Exact inference for graphical models 84


20.1 Variable elimination . . . . . . . . . . . . . . . . . . . . . . . 84
20.2 Gaussian times Gaussian is Gaussian . . . . . . . . . . . . . . 84
20.3 Message passing on a tree . . . . . . . . . . . . . . . . . . . . 84
20.4 Inference in 2D lattice MRFs . . . . . . . . . . . . . . . . . . 86

21 Variational inference 87
21.1 Laplace approximation to p(µ, log σ|D) for a univariate Gaus-
sian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
21.2 Laplace approximation to normal-gamma . . . . . . . . . . . 88
21.3 Variational lower bound for VB for univariate Gaussian . . . 88
21.4 Variational lower bound for VB for GMMs . . . . . . . . . . . 89
21.5 Derivation of E[log πk ] . . . . . . . . . . . . . . . . . . . . . . 91
21.6 Alternative derivation of the mean field updates for the Ising
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
21.7 Forwards vs reverse KL divergence . . . . . . . . . . . . . . . 92
21.8 Derivation of the structured mean field updates for FHMM . 92
21.9 Variational EM for binary FA with sigmoid link . . . . . . . . 93
21.10 VB for binary FA with probit link . . . . . . . . . . . . . . . 93
1 INTRODUCTION 8

1 Introduction
1.1 Constitution of this document
Here we should have demonstrated the solution to problems in Chapter
One in Machine Learning, A Probabilistic Perspective(MLAPP). Since the
number of problem in Chapter is zero, we save this section as an introduction
to this document, i.e.a solution manual.
This document provides detailed solution to almost all problems of
textbook MLAPP from Chapter One to Chapter Fourteen(Chinese version)
/ Twenty-one(English version). We generally save the restatement of prob-
lems for readers themselves.
There are two class for problems in MLAPP: theortical inference and
pratical projects. We provide solution to most inference problems apart
from those which are nothing but straightforward algebra(and few which
we failed to solve). Practical problems, which base on a Matlab toolbox,
are beyond the scope of this document.

1.2 On Machine Learning: A Probabilistic Perspective


Booming studies and literatures have made the boundary of ”machine
learning” vague.
On one hand, the rapid development of AI technology has kept the
society shocked, which also results in sharply increase in number of students
who would try to take related courses in colleges. On the other hand,
some scholars are still uncertain in learning-related theories, especially deep
learning.
The extraordinary achievements of machine learning in recent days of-
ten make one forget that this discipline has undergone a long evolution and
whose establishment dates back at least to the studies of ”electronic brains”
in the 1940s. Be that as it may, machine learning has not been defined as
a ”closed” theory. Even in the some community of researchers, machine
learning are crowned metaphysics or alchemistry. Personally, I believe tha
being called metaphysics is a common experience shared by many branches
of theory which are undergoing the most rapid development.
1 INTRODUCTION 9

To be a completed theory, machine learning is still looking for a way


to conclude itself in a closed system. The most successful attempt so far
has been the one based on probability. As commented by David Blei from
Princton on the back of MLAPP: ”In Machine Learning, the language of
probability and statistics reveals important connections between seemingly
disparate algorithms and strategies. Thus, its readers will become articulate
in a holistic view of the state-of-art and poised to build the next generation
of machine learning algorithms.”
The crucial idea in MLAPP is: machine learning is tantamount to
Bayesian Statistics, which draws connections between numerous ”indepe-
dent” algorithms. But the history of Bayesian statistics(which can be traced
back to days of Laplace) outlengths the one of machinea learning a lot.
MLAPP is not noval in holding such an idea. C.M.Bishop’s Pattern Recog-
nition and Machine Learning is another typical example. Both of them are
considered as classical textbooks in elementary machine learning.
In general, MLAPP reduces the difficulty of the entire book at the
expense of partially deduced completeness(for the first seven chapters). It
covers a wider range of models and is suitable for those with background
in mathemathcal tools. The chapters that concerning classical probabilistic
models (e.g, chapter 2, 3, 4, 5, 7, 8, 11, 12) is comparable to PRML. But
due to the reordering and more details, they worth a read for one who have
finished reading PRML.

1.3 What is this document?


The motivation for writing this document is that I need to read text-
book MLAPP after selecting machine learning course, but I failed to find
any free compiled solution manuals. Although several Github projects have
started working on it, the velocity has been too slow. Also I want to focus
more on the theoretical part of the text rather than the implementation
code.
Hence I began working on this document. It is completed(first version,
Chapter One to Chapter Fourteen) within the first two weeks before the
official semester. Bacase of the hurry process, it is suggested that readers
1 INTRODUCTION 10

should read from a critical perspective and not hesitate to believe in ev-
erything I have written down. In the end, I hope that readers can provide
comments and revise opinions. Apart from correcting the wrong answers,
those who good at using MATLAB, Latex typesetting or those who are will-
ing to participate in the improvement of the document are always welcome
to contact me.

22/10/2017
Fangqi Li
Munich, Germany
[email protected]
[email protected]
1 INTRODUCTION 11

1.4 Updating log


22/10/2017(First Chinese compilation)
02/03/2018(English compilation)
2 PROBABILITY 12

2 Probability
2.1 Probability are sensitive to the form of the question
that was used to generate the answer
Denote two children by A and B.
Use
E1 : A = boy, B = girl

E2 : B = boy, A = girl

E3 : A = boy, B = boy

In question a:

1
P (E1 ) = P (E2 ) = P (E3 ) =
4

P (E1 ) + P (E2 ) 2
P (OneGirl|OneBoy) = =
P (E1 ) + P (E2 ) + P (E3 ) 3
For question b,w.l.o.g, assume child A is seen:
1
P (B = girl|A = boy) =
2

2.2 Legal reasoning


Let E1 and E2 denote the event ”the defendant commited the crime”
and ”the defendant has special blood type” respectively. Thus:
p(E1 , E2 ) p(E2 |E1 )p(E1 )
p(E1 |E2 ) = =
p(E2 ) p(E2 )
1
1 · 800000 1
= 1 =
8000
100

2.3 Vriance of a sum


Calculate this straightforwardly:

var[X + Y ] =E[(X + Y )2 ] − E2 [X + Y ]
=E[X 2 ] − E2 [X] + E[Y 2 ] − E2 [Y ] + 2E[XY ] − 2E2 [XY ]
=var[X] + var[Y ] + 2cov[X, Y ]
2 PROBABILITY 13

2.4 Bayes rule for medical diagnosis


Applying Bayes’s rules:
P (ill)P (positive|ill)
P (ill|positive) =
P (ill)P (positive|ill) + P (health)P (positive|health)
=0.0098

2.5 The Monty Hall problem(The dilemma of three doors)


The answer is b. Applying Bayes’s rules:
P (choose1 )P (prize1 )P (choose3 |prize1 , choose1 )
P (prize1 |choose1 , open3 ) =
P (choose1 )P (open3 |choose1 )
P (prize1 )P (choose3 |prize1 , choose1 )
=
P (open3 |choose1 )
1 1
· 1
= 1 1 31 2 1 =
· + 3 ·0+ 3 ·1
3 2
3

In the last step we summarize over the potential location of the prize.

2.6 Conditional Independence


In question a, we have:
p(H)p(e1 , e2 |H)
p(H|e1 , e2 ) =
p(e1 , e2 )

Thus the answer is (ii).


For question, we have further decomposition:
p(H)p(e1 |H)p(e2 |H)
p(H|e1 , e2 ) =
p(e1 , e2 )

So both (i) and (ii) and sufficient obviously. Since:


X
p(e1 , e2 ) = p(e1 , e2 , H)
H
X
= p(H)p(e1 |H)p(e2 |H)
H

(iii) is sufficint as well since we can calculate p(e1 , e2 ) independently.


2 PROBABILITY 14

2.7 Pairwise independence does not imply mutual inde-


pendence
Let’s assmue three boolean variables x1 , x2 , x3 , x1 and x2 have values
of 0 or 1 with equal possibility independently and x3 = XOR(x1 , x2 ):
It is easy to prove that x3 is independent with x1 or x2 , but given
both x1 and x2 , the value of x3 is determined and thereby the mutual
independence fails.

2.8 Conditional independence iff joint factorizes


We prove 2.129 is equal to 2.130.
By denoting:
g(x, z) = p(x|z)

h(y, z) = p(y|z)

We have the first half of proof.


Secondly we have:
X
p(x|z) = p(x, y|z)
y
X
= g(x, z)h(y, z)
y
X
=g(x, z) h(y, z)
y
X
p(y|z) =h(y, z) g(x, z)
x

And:
X
1= p(x, y|z)
x,y
X X
=( g(x, z))( h(y, z))
x y
2 PROBABILITY 15

Thus:
X X
p(x|z)p(y|z) =g(x, z)h(y, z)( g(x, z))( h(y, z))
x y

=g(x, z)h(y, z)
=p(x, y|z)

2.9 Conditional independence*


From a graphic view, both arguments are correct. But from a general
view, both of them do not have a decomposition form, thus false.

2.10 Deriving the inverse gamma density


According to:
dx
p(y) = p(x)| |
dy
We easily have:

IG(y) =Ga(x) · y −2
ba 1 (a−1)+2 − yb
= ( ) e
Γ(a) y
ba b
= (y)−(a+1) e− y
Γ(a)

2.11 Normalization constant for a 1D Gaussian


This proof should be found around any textbook about multivariable
calculus.Omitted here.
2 PROBABILITY 16

2.12 Expressing mutual information in terms of entropies

X p(x, y)
I(X; Y ) = p(x, y) log
x,y
p(x)p(y)
X p(x|y)
= p(x, y) log
x,y
p(x)
X XX
= p(x, y) log p(x|y) − ( p(x, y)) log p(x)
x,y x y

= − H(X|Y ) + H(X)

Inversing X and Y yields to another formula..

2.13 Mutual information for correlated normals


We have:

I(X1 ; X2 ) =H(X1 ) − H(X1 |X2 )


=H(X1 ) + H(X2 ) − H(X1 , X2 )
1 1 1
= log 2πσ 2 + log 2πσ 2 + log(2π)2 σ 4 (1 − ρ2 )
2 2 2
1
= − log(1 − ρ2 )
2
(refer to Elements of Information Theory,Example 8.5.1)

2.14 A measure of correlation


In question a:
H(Y |X) H(X) − H(Y |X)
r =1 − =
H(X) H(X)
H(Y ) − H(Y |X)
=
H(X)
I(X; Y )
=
H(X)
We have 0 ≤ r ≤ 1 in question b for I(X; Y ) > 0 and H(X|Y ) <
H(X)(properties of entropy).
r = 0 iff X and Y are independent.
r = 1 iff X is determined(not necassary equal) by Y .
2 PROBABILITY 17

2.15 MLE minimizes KL divergence to the empirical dis-


tribution
Expand the KL divergence:

θ = arg min {KL(pemp ||q(θ))}


θ
 
pemp
= arg min Epemp [log ]
θ p(θ)
( )
X
= arg min −H(pemp ) − (log q(x; θ))
θ
x∈dataset
( )
X
= arg max log p(x; θ)
θ
x∈dataset

We use the weak law of large numbers in the third step and drop the
entropy of empirical distribution in the last step.

2.16 Mean, mode, variance for the beta distribution


Firstly, derive the mode for beta distribution by differentiating the pdf:
d a−1
x (1 − x)b−1 = [(1 − x)(a − 1) − (b − 1)x]xa−2 (1 − x)b−2
dx
Setting this to zero yields:
a−1
mode =
a+b−2
Secondly, derive the moment in beta distribution:
Z
1
N
E[x ] = xa+N −1 (1 − x)b−1 dx
B(a, b)
B(a + N, b)
=
B(a, b)
Γ(a + N )Γ(b) Γ(a + b)
=
Γ(a + N + b) Γ(a)Γ(b)
Setting N = 1, 2:
a
E[x] =
a+b
a(a + 1)
E[x2 ] =
(a + b)(a + b + 1)
2 PROBABILITY 18

Where we have used the property of Gamma function. Straightforward


algebra gives:
a
mean = E[x] =
a+b
ab
variance = E[x2 ] − E2 [x] =
(a + b)2 (a + b + 1)

2.17 Expected value of the minimum


Let m denote the location of the left most point, we have:

p(m > x) =p([X > x]and[Y > x])


=p(X > x)p(Y > x)
=(1 − x)2

Therefore:
Z
E[m] = x · p(m = x)dx
Z
= p(m > x)dx
Z 1
= (1 − x)2 dx
0
1
=
3
3 GENERATIVE MODELS FOR DISCRETE DATA 19

3 Generative models for discrete data


3.1 MLE for the Beroulli/binomial model
Likelihood:
p(D|θ) = θN1 (1 − θ)N0

Log-Likelihood:

ln p(D|θ) = N1 ln θ + N0 ln(1 − θ)

Setting the derivative to zero:


∂ N1 N0
ln p(D|θ) = − =0
∂θ θ 1−θ
This ends in 3.22:
N1 N1
θ= =
N1 + N0 N

3.2 Marginal likelihood for the Beta-Bernoulli model


Likelihood:
p(D|θ) = θN1 (1 − θ)N0

Prior distribution:

p(θ|a, b) = Beta(θ|a, b) = θa−1 (1 − θ)b−1

Posterior distribution:

p(θ|D) ∝p(D|θ) · p(θ|a, b)


=θN1 +a−1 (1 − θ)N0 +b−1
=Beta(θ|N1 + a, N0 + b)

Prediction:
Z
p(xnew = 1|D) = p(xnew = 1|θ) · p(θ|D)dθ
Z
= θp(θ|D)dθ
N1 + a
=E(θ) =
N1 + a + N0 + b
3 GENERATIVE MODELS FOR DISCRETE DATA 20

Calcualte p(D) where D = 1, 0, 0, 1, 1:

p(D) =p(x1 )p(x2 |x1 )p(x3 |x2 , x1 )...p(XN |xN −1 , XN −2 , ...X1 )


a b b+2 a+1 a+2
=
a+ba+b+1a+b+2a+b+3a+b+4
Denote α = a + b, α1 = a, α0 = b, we have 3.83. To derive 3.80, we
make use of:
(α1 + N1 − 1)! Γ(α1 + N1 )
[(α1 )..(α1 + N1 − 1)] = =
(α1 − 1)! Γ(α1 )

3.3 Posterior predictive for Beta-Binomial model


Straightforward algebra:
B(α10 + 1, α00 )
Bb(α10 , α00 , 1) =
B(α10 , α00 )
Γ(α00 + α10 ) Γ(α10 + 1)
=
Γ(α00 + α10 + 1) Γ(α10 )
α0
= 0 1 0
α1 + α0

3.4 Beta updating from censored likelihood


The derivation is straightforward:

p(θ, X < 3) =p(θ)p(X < 3|θ)


=p(θ)(p(X = 1|θ) + p(X = 2|θ))
=Beta(θ|1, 1)(Bin(1|5, θ) + Bin(2|5, θ))

3.5 Uninformative prior for log-odds ratio


Since:
θ
φ = log
1−θ
By using change of variables formula:
dφ 1
p(θ) = p(φ)| |∝
dθ θ(1 − θ)

p(θ) =∝ Beta(θ|0, 0)
3 GENERATIVE MODELS FOR DISCRETE DATA 21

3.6 MLE for the Poisson distribution


Likelihood:
N
Y PN
xn 1
p(D|P oi, λ) = P oi(xn |λ) = exp(−λN ) · λ n=1 · QN
n=1 n=1 xn !

Setting the derivative of Log-Likelihood to zero:


( N
)
∂ P
x−1
X
p(D|P oi, λ) = exp(−λN )λ −N λ + xn
∂λ n=1

Thus: PN
n=1 xn
λ=
N

3.7 Bayesian analysis of the Poisson distribution


We have:

p(λ|D) ∝p(λ)p(D|λ)
PN
∝ exp(−λ(N + b)) · λ n=1 xn +a−1
X
=Gamma(a + x, N + b)

This prior distribution equals introduing b prior observations with mean


a
b
.

3.8 MLE for the uniform distrbution


The likelihood goes to zero if a < max(xn ), so we must have â ≥
max(xn ), likelihood lookes like:
N
Y 1
p(D|a) =
n=1
2a

Which has a negative correlation with a, so:


n
â = max {xi }i=1
n
This model assign p(xn+1 ) = 0 if xn+1 > max {xi }i=1 , which gives a
”hard” boundary in distribution.
3 GENERATIVE MODELS FOR DISCRETE DATA 22

3.9 Bayesian analysis of the uniform distribution


The conjugate prior for uniform distribution if Pareto distribution:

p(θ) = P a(θ|K, b) = KbK θ−(K+1) [θ ≥ b]


n
Let m = max {xi }i=1 , the joint distribution is:

p(θ, D) = p(θ)p(D|θ) = KbK θ−(K+N +1) [θ ≥ b][θ ≥ m]

And the evidence is:


KbK
Z
p(D) = p(D, θ)dθ =
(N + K) max(m, b)N +K
Let µ = max {m, b}, the posterior distribution is again the form of a
Parato distribution:
p(θ, D) (N + K)µN +K [θ ≥ µ]
p(θ|D) = = = P a(θ|N + K, µ)
p(D) θN +K+1

3.10 Taxicab problem*


We skip this straightforward numerical task.

3.11 Bayesian analysis of the exponential distribution


Log-Likelihood for an exponential distribution is:
N
X
ln p(D|θ) = N ln θ − θ xn
n=1

The derivative is:


N
∂ N X
ln p(D|θ) = − xn
∂θ θ n=1

Thus in question a:
N
θM L = PN
xn n=1
We skip other questions and state that the conjugate prior for expo-
nential distribution is Gamma distribution:

p(θ|D) ∝p(θ)p(D|θ)
=Gamma(θ|a, b)p(D|θ)
X
=Gamma(θ|N + a, b + xn )
3 GENERATIVE MODELS FOR DISCRETE DATA 23

A Gamma prior introduces a − 1 prior observation with a sum b.

3.12 MAP estimation for the Bernoulli with non-conjugate


priors*
We skip this straightforward numerical task.

3.13 Posterior predictive distribution for a batch of data


with the dirichlet-multinomial model
Since we already have 3.51:
αj + N j
p(X = j|D, α) =
α0 + N
We can easily derive:
Y
p(D̃|D, α) = p(x|D, α)
x∈D̃
C
Y αj + Njold N new
= ( old
) j
j=1
α 0 + N

3.14 Posterior predictive for Dirichlet-multinomial*


We skip this straightforward numerical task.

3.15 Setting the hyper-parameters I*


We skip this straightforward numerical task.

3.16 Setting the beta hyper-parameters II


For paremeters of a Beta distribution α1 和α2 are connected through:
1
α2 = α1 ( − 1) = f (α1 )
m
Calculate this intergral:
Z u
1
θα1 (1 − θ)f (α1 ) = u(α1 )
l B(α 1 , f (α 1 ))
Setting this intergral u(α1 ) → 0.95 by altering α1 through numerical
method will do.
3 GENERATIVE MODELS FOR DISCRETE DATA 24

3.17 Marginal likelihood for beta-binomial under uniform


prior
The marginal likelihood is given by:
Z 1 Z 1
p(N1 |N ) = p(N1 , θ|N )dθ = p(N1 |θ, N )p(θ)dθ
0 0

We already have:

p(N1 |θ, N ) = Bin(N1 |θ, N )

p(θ) = Beta(1, 1)

Thus:
Z 1 
N
p(N1 |N ) = θN1 (1 − θ)N −N1 dθ
N1
0 
N
= B(N1 + 1, N − N1 + 1)
N1
N! N1 !(N − N1 )!
=
N1 !(N − N1 )! (N + 1)!
1
=
N +1
Where B is the regulizer for a Beta distribution:
Γ(a)Γ(b)
B(a, b) =
Γ(a + b)

3.18 Bayes factor for coin tossing*


Straightforward calculation.

3.19 Irrelevant features with naive Bayes


Log-Likelihood is given by:
W W
X θcw X
log p(xi |c, θ) = xiw log + log(1 − θcw )
w=1
1 − θcw w=1

In a succint way:

log p(xi |c, θ) = φ(xi )T βc


3 GENERATIVE MODELS FOR DISCRETE DATA 25

Where:
φ(xi ) = (xi , 1)T
W
θc1 X
βc = (log , ... log(1 − θcw ))T
1 − θc1 w=1

For question a:
p(c = 1|xi ) p(c = 1)p(xi |c = 1)
log = log
p(c = 2|xi ) p(c = 2)p(xi |c = 2)
p(xi |c = 1)
= log
p(xi |c = 1)
=φ(xi )T (β1 − β2 )

For question b, in a binary context:


p(c = 1)p(xi |c = 1)
p(c = 1|xi ) =
p(xi )
Thus:
p(c = 1|xi ) p(c = 1)
log = log + φ(xi )T (β1 − β2 )
p(c = 2|xi ) p(c = 2)
A word w will not affect this posterior measure as long as:

xiw (β1,w − β2,w ) = 0

Hence:
θc=1,w = θc=2,w

So the chance that word w appears in both class of documents are


equal.
In question c, we have:
1
θ̂1,w = 1 −
2 + N1
1
θ̂2,w = 1 −
2 + N2
They do not equal when N1 6= N2 so the bias effect remains. However,
this effect reduces when N grows large.
3 GENERATIVE MODELS FOR DISCRETE DATA 26

3.20 Class conditional densities for binary data


In question a, we have:
D
Y
p(x|y = c) = p(xi |y = c, x1 , ..., xi−1 )
i=1

The number of parameter is:


D
X
C· 2i = C · (2D+1 − 2) = O(C · 2D )
i=1

For question b and c, we generally think that naive models fit better
when N is large, because delicate models have problems of overfitting.
In question d,e and f, it is assumed that looking up for a value according
to a D-dimensional index cost O(D) time. It is easy to calculate the fitting
complexity: O(N D) for a naive model and O(N · 2D ) for a full model, and
the applying complexity is O(CD) and O(C · 2D ) respectively.
For question f:
X
p(y|xv ) ∝ p(xv |y) = p(xv , xh |y)
xh

Thus the complexity is multiplied by an extra const 2|xh | .

3.21 Mutual information for naive Bayes classifiers with


binary features
By definition:
XX p(xj , y)
I(X; Y ) = p(xj , y) log
xj y
p(xj )p(y)

For binary features, consider the value of xj to be zero and one, given
πc = p(y = c), θjc = p(xj = 1|y = c), θj = p(xj = 1):
X p(xj = 1, c)
Ij = p(xj = 1, c) log
c
p(xj = 1)p(c)
X p(xj = 0, c)
+ p(xj = 0, c) log
c
p(xj = 0)p(c)
X θjc 1 − θjc
= πc θjc log + (1 − θjc )πc log
c
θj 1 − θj
Which ends in 3.76.
3 GENERATIVE MODELS FOR DISCRETE DATA 27

3.22 Fitting a naive Bayesian spam filter by hand*


Straightforward calculation.
4 GAUSSIAN MODELS 28

4 Gaussian models
4.1 Uncorrelated does not imply independent
We first calculate the covariance of X and Y :
Z Z
cov(X, Y ) = (X − E(X))(Y − E(Y ))p(X, Y )dXdY
Z 1
1
= X(X 2 − √ )dX = 0
−1 3
The intergral ends in zero since we are intergrating an odd function in
range [-1,1], hence:

cov(X, Y )
ρ(X, Y ) = p =0
var(X)var(Y )

4.2 Uncorrelated and Gaussian does not imply indepen-


dent unless jointly Gaussian
The pdf for Y is:

p(Y = a) = 0.5 · p(X = a) + 0.5 · p(X = −a) = p(X = a)

The pdf of X is symetric with 0 as the core, so Y subject to a normal


distribution (0, 1).
For question b, we have:

cov(X, Y ) =E(XY ) − E(X) − E(Y )


=EW (E(XY |W )) − 0
=0.5 · E(X 2 ) + 0.5 · E(−X 2 ) = 0

4.3 Correlation coefficient is between -1 and 1


The statement:
−1 ≤ ρ(X, Y ) ≤ 1

Equals:
|ρ(X, Y )| ≤ 1
4 GAUSSIAN MODELS 29

Hence we are to prove:

|cov(X, Y )|2 ≤ var(X) · var(Y )

Which can be drawn straightforwardly from Cauchy–Schwarz inequal-


ity in R2 .

4.4 Correlation coefficient for linearly related variables is


1 or -1
When Y = aX + b:
E(Y ) = aE(x) + b
var(Y ) = a2 var(X)
Therefore:

cov(X, Y ) =E(XY ) − E(X)E(Y )


=aE(X 2 ) + bE(X) − aE2 (X) − bE(X)
=a · var(X)

We also have:
var(X)var(Y ) = a2 · var(X)
These two make:
a
ρ(X, Y ) =
|a|

4.5 Normalization constant for a multidimensional Gaus-


sian
This can be obtained by applying the method mentioned in problem
straightforwardly, hence omitted.

4.6 Bivariate Gaussian


Straightforward algebra.

4.7 Conditioning a bivariate Gaussian


Answers are obtained by plugging figures in 4.69 straightforwardly.
4 GAUSSIAN MODELS 30

4.8 Whitening vs standardizing


Practical by yourself.

4.9 Sensor fusion with known variances in 1d


Denate the two observed datasets by Y (1) and Y (2) , with size N1 , N2 ,
the likelihood is:
N1
Y N2
Y
(1) (2)
p(Y ,Y |µ) = p(Yn(1)
1
|µ) p(Yn(2)
2
|µ)
n1 =1 n2 =1
2

∝ exp A · µ + B · µ

Where we have used:


N1 N2
A=− −
2v1 2v2
N1 N2
1 X 1 X
B= Yn(1) + Y (2)
v1 n =1 1 v2 n =1 n2
1 2

Differentiate the likelihood and set is to zero, we have:


B
µM L = −
2A
The conjugate prior of this model must have form proporitional to
exp {A · µ2 + B · µ}, namely a normal distribution:

p(µ|a, b) ∝ exp a · µ2 + b · µ


The posterior distribution is:

p(µ|Y ) ∝ exp (A + a) · µ2 + (B + b) · µ


Hence we have the MAP estimation:


B+b
µM AP = −
2(A + a)
It is noticable that the MAP converges to ML estimation when obser-
vation times grow:
µM AP → µM L
The posterior distribution is another normal distribution, with:
2 1
σM AP = −
2(A + a)
4 GAUSSIAN MODELS 31

4.10 Derivation of information form formulae for marginal-


izing and conditioning
Please refer to PRML chapter 2.

4.11 Derivation of the NIW posterior


The likelihood for a MVN is given by:
( N
)
− N2D −N 1 X
T −1
p(X|µ, Σ) = (2π) |Σ| 2 exp − (xi − µ) Σ (xi − µ)
2 n=1

According to 4.195:
N
X N
X
(xi − µ)T Σ−1 (xi − µ) = (x̄ − µ + (xi − x̄))T Σ−1 (x̄ − µ + (xi − x̄))
n=1 n=1
N
X
=N (x̄ − µ)T Σ−1 (x̄ − µ) + (xi − x̄)T Σ−1 (xi − x̄)
n=1
( N
)
X
T −1 −1
=N (x̄ − µ) Σ (x̄ − µ) + T r Σ (xi − x̄)(xi − x̄)T
n=1
T −1 −1

=N (x̄ − µ) Σ (x̄ − µ) + T r Σ Sx̄

The conjugate prior for MVN’s parameters (µ, Σ) is Normal-inverse-


Wishart(NIW) distribution, defined by:
1
N IW (µ, Σ|m0 , k0 , v0 , S0 ) = N (µ|m0 , Σ) · IW (Σ|S0 , v0 )
k0
 
1 v +D+2
− 0 2 k0 T −1 1  −1
= |Σ| exp − (µ − m0 ) Σ (µ − m0 ) − T r Σ S0
Z 2 2
Hence the posterior:
 

vX +D+2 kX 1 
exp − (µ − mX )T Σ−1 (µ − mX ) − T r Σ−1 SX

p(µ, Σ|X) ∝ |Σ| 2
2 2
Where we have:
kX = k0 + N

vX = v0 + N
N x̄ + k0 m0
mX =
kX
4 GAUSSIAN MODELS 32

By comparing the exponential for |Σ|,µT Σ−1 µ and µT .


Making use of AT Σ−1 A = T r AT Σ−1 A = T r Σ−1 AAT and com-
 

paring the constant term inside the exponential function:

N x̄x̄T + SX̄ + k0 m0 mT0 + S0 = kX mX mTX + SX

Hence

SX = N x̄x̄T + SX̄ + k0 m0 mT0 + S0 − kX mX mTX

Use the definition for mean we ends in 4.214 since:


N
X
S= xi xTt = SX̄ + N x̄x̄T
n=1

Hence the posterior distribution for MVN takes the form:N IW (mX , kX , vX , SX )

4.12
Straightforward calculation.

4.13 Gaussian posterior credible interval


Assume a prior distribution for an 1d normal distribution:

p(µ) = N (µ|µ0 , σ02 = 9)

And the observed variable subjects to:

p(x) = N (x|µ, σ 2 = 4)

Having observed n variables, it is vital that the probability mass of µ’s


posterior distribution is no less than 0.95 in an interval no longer than 1.
Posterior for µ is:
n
Y
p(µ|D) ∝ p(µ)p(D|µ) =N (µ|µ0 , σ02 ) N (xn |µ, σ 2 )
i=1
 Yn  
1 2 1 2
∝ exp − 2 (µ − µ0 ) exp − 2 (xi − µ)
2σ0 i=1

 
1 n
= exp (− 2 − 2 )µ2 + ...
2σ0 2σ
4 GAUSSIAN MODELS 33

Hence the posterior variance is given by:

2 σ02 σ 2
σpost =
σ 2 + nσ02
Since 0.95 of probability mass for a normal distribution lies within
−1.96σ and 1.96σ, we have:
n ≥ 611

4.14 MAP estimation for 1d Gaussians


Assume the variance for this distribution σ 2 is known, the mean µ
subject to a normal distribution with mean m and variance s2 , similiar to
the question before, the posterior takes the form:

p(µ|X) ∝ p(µ)p(X|µ)

The posterior is another normal distribution, by comparing the coeffi-


cient for µ2 :
1 N
− 2
− 2
2s 2σ
And that for µ:
PN
m n=1 xn
+
s2 σ2
We have the posterior mean and variance by the technique ”completing
the square”:
2 s2 σ 2
σpost =
σ 2 + N s2
PN
m xn 2
µpost = ( 2 + n=1 2
) · σpost
s σ
Already we knew the MLE is:
PN
xi
µM L = n=1
N
When N increases, µpost converges to µM L .
Consider the variance s2 . When it increases, the MAP goes to MLE,
when in decreases, ,the MAP goes to prior mean. Prior variance quantify
our confidence in the prior guess. Intuitively, the larger the prior variance,
the less we trust the prior mean.
4 GAUSSIAN MODELS 34

4.15 Sequential(recursive) updating of covariance matrix


Making use of:
nmn + xn+1
mn+1 =
n+1
What left is straightforward algebra.

4.16 Likelihood ratio for Gaussians


Consider a classifier for two classes, the generative distribution for them
are two normal distributionsp(x|y = Ci ) = N (x|µi , Σi ), by Bayes formula:

p(y = 1|x) p(x|y = 1) p(y = 1)


log = log + log
p(y = 0|x) p(x|y = 0) p(y = 0)
The second term is the ratio of likelihood probability.
When we have arbitrary covariance matrix:
s  
p(x|y = 1) |Σ0 | 1 1
= exp − (x − µ1 )T Σ−1
1 (x − µ 1 ) + (x − µ 0 ) T −1
Σ0 (x − µ 0 )
p(x|y = 0) |Σ1 | 2 2

This can not be reduced further. However, it is noticable that the


decision boundary is a quardric curve in D-dimension space.
When both covariance matrixes are given by Σ:
 
p(x|y = 1) T −1 1  −1 T T

= exp x Σ (µ1 − µ0 ) − T r Σ (µ1 µ1 − µ0 µ0 )
p(x|y = 0) 2

The decision boundary becomes a plate.


If we assume the covariance matrix to be a diagnoal matrix, the closed
form of answer have a similiar look, with some matrix multiplation changed
into inner product or arthimatic multiplation.

4.17 LDA/QDA on height/weight data


Practise by youself.

4.18 Naive Bayes with mixed features


Straightforward calculation.
4 GAUSSIAN MODELS 35

4.19 Decision boundary for LDA with semi tied covari-


ances
Omitting the shared parameters ends in:
p(y = 1)p(x|y = 1)
p(y = 1|x) =
p(y = 0)p(x|y = 0) + p(y = 1)p(x|y = 1)
Consider a uniform prior, this can be reduced to:
p(x|y = 1)
p(x|y = 0) + p(x|y = 1)
1
=
µ0 )T Σ−1 − µ0 ) + 21 (x − µ1 )T Σ−1
D
− 21 (x

k 2 exp − 0 (x 1 (x − µ1 ) + 1

1
=
)xT Σ−1
D
− 21 (1 1

k 2 exp − k 0 x + xT u + c + 1
Where we have used:

|Σ1 | = |kΣ0 | = k D |Σ0 |

The decision boundary is still a quardric curve. It reduces to a plate


when k = 1. When k increases, the decision boundary becomes a curve that
surrenders µ0 . When k goes to infinity, the decision boundary degenerates
to a y = 0 curve, which implies that every space out of it belongs to a
normal distribution with infinite variance.

4.20 Logistic regression vs LDA/QDA


We give a qualitative answer according to the argument ”overfitting
arises from MLE, and is in a positive correlation with the complexity of the
model(namely the number of independent parameters in the model)”.
GaussI assumes a covariance matrix propoetional to identity matrix;
GaussX has not prior assumption on covariance matrix;
LinLog assumes that different classes have the same covariance matrix;
QuadLog has not prior assumption on covariance matrix;
From the perspective of complexity:
QuadLog =GaussX > LinLog > GaussI
The accuracy of MLE follows the same order.
4 GAUSSIAN MODELS 36

The argument in e is not true in general, a larger product does not


necessarily imply a larger sum.

4.21 Gaussian decision boundaries


Straightforward albegra.

4.22 QDA with 3 classes


Straightforward calculation.

4.23 Scalar QDA


Practice by yourself.
5 BAYESIAN STATISTICS 37

5 Bayesian statistics
5.1 Proof that a mixture of conjugate priors is indeed
conjugate
For 5.69 and 5.70, formly:
X X
p(θ|D) = p(θ, k|D) = p(k|D)p(θ|k, D)
k k

Where:
p(k, D) p(k)p(D|k)
p(k|D) = =P 0 0
p(D) k0 p(k )p(D|k )

5.2 Optimal threshold on classification probability


The posterior loss expectation is given by:
X
ρ(ŷ|x) = L(ŷ, y)p(y|x) = p0 L(ŷ, 0) + p1 L(ŷ, 1)
y

=L(ŷ, 1) + p0 (L(ŷ, 0) − L(ŷ, 1))

When two classficied result yield to the same loss:


λ01
pˆ0 =
λ01 + λ10
Hence when p0 ≥ pˆ0 , we estimete ŷ = 0。

5.3 Reject option in classifiers


The posterior loss expectation is given by:
X
ρ(a|x) = L(a, c)p(c|x)
c

Denote the class with max posterior confidence by ĉ:

ĉ = argmaxc {p(c|x)}

Now we have two applicable actions: a = ĉ or a =reject.


When a = ĉ, the posterior loss expectation is:

ρĉ = (1 − p(ĉ|x)) · λs
5 BAYESIAN STATISTICS 38

When reject, the posterior loss expectation is:

ρreject = λr

Thus the condition that we choose a = ĉ instead of reject is:

ρĉ ≥ ρreject

Or:
λr
p(ĉ|x) ≥ 1 −
λs

5.4 More reject options


Straightforward calculation.

5.5 Newsvendor problem


By:
Z Q Z Q Z +∞
E(π|Q) = P Df (D)dD − CQ f (D)dD + (P − C)Q f (D)dD
0 0 Q

We have:
Z Q Z +∞

E(π|Q) = P Qf (Q)−C f (D)dD−CQf (Q)+(P −C) f (D)dD−(P −C)Qf (Q)
∂Q 0 Q

RQ R +∞
Set it to zero by making use of 0
f (D)f D + Q
f (D)dD = 1:

Q∗
P −C
Z
= F (Q∗ ) =
0 P

5.6 Bayes factors and ROC curves


Practise by yourself.

5.7 Bayes model averaging helps predictive accuracy


Expand both side of 5.127 and exchange the integral sequence:

E[L(∆, pBM A )] = H(pBM A )


5 BAYESIAN STATISTICS 39

We also have:

E[L(∆, pm )] = EpBM A [− log(pm )]

Substract the right side from the left side ends in:

−KL(pBM A ||pm ) ≤ 0

Hence the left side is always smaller than the right side.

5.8 MLE and model selection for a 2d discrete distribu-


tion
The joint distribution p(x, y|θ1 , θ2 ) is given by:

p(x = 0, y = 0) = (1 − θ1 )θ2

p(x = 0, y = 1) = (1 − θ1 )(1 − θ2 )
p(x = 1, y = 0) = θ1 (1 − θ2 )
p(x = 1, y = 1) = θ1 θ2
Which can be concluded as:

p(x, y|θ1 , θ2 ) = θ1x (1 − θ1 )(1−x) θ2 (1 − θ2 )(1−I(x=y))


I(x=y)

The MLE is:


XN
θM L = argmaxθ ( ln p(xn , yn |θ))
n=1

Hence:
1 − θ1 θ1 θ2
θM L = argmaxθ (N ln( ) + Nx ln( ) + NI(x=y) ln( ))
1 − θ2 1 − θ1 1 − θ2
Two parameters can be estimated independently given X and Y.
We can further rewrite the joint distribution into:

p(x, y|θ) = θx,y

Then
X
θM L = argmaxθ ( Nx,y ln θx,y )
x,y

MLE can de done by using regularization condition.


The rest is straightforward algebra.
5 BAYESIAN STATISTICS 40

5.9 Posterior median is optimal estimate under L1 loss


The posterior loss expectation is(where we have omitted D w.l.o.g):
Z Z a Z +∞
ρ(a) = |y − a|p(y)dy = (a − y)p(y)dy + (y − a)p(y)dy
−∞ a
Z a Z +∞  Z a Z +∞
=a p(y)dy − p(y)dy − yp(y)dy + yp(y)dy
−∞ a −∞ a

Differentiate and we have:


Z a Z +∞ 

ρ(a) = p(y)dy − p(y) + a · 2p(a) − 2ap(a)
∂a −∞ a

Set it to zero and:


Z a Z +∞
1
p(y)dy = p(y) =
−∞ a 2

5.10 Decision rule for trading off FPs and FNs


Given:
LF N = cLF P

The critical condition for 5.115 is:


p(y = 1|x)
=c
p(y = 2|x)
Using:
p(y = 1|x) + p(y = 0|x) = 1
c
We get the threshold 1+c
.
6 FREQUENTIST STATISTICS 41

6 Frequentist statistics
The philosophy behind this chapter is out of the scope of probabilistic
ML, you should be able to find solutions to the four listed problems in a
decent textbook on mathematics statistics.
GL.
7 LINEAR REGRESSION 42

7 Linear regression
7.1 Behavior of training set error with increasing sample
size
When the training set is small at the beginning, the trained model
is over-fitted to the current data set, so the correct rate can be relatively
high. As the training set increases, the model has to learn to adapt to more
general-purpose parameters, thus reducing the overfitting effect laterally,
resulting in lower accuracy.
As pointed out in Section 7.5.4, increasing the training set is an impor-
tant method of countering over-fitting besides adding regulizer.

7.2 Multi-output linear regression


Straightforward calculation.

7.3 Centering and ridge regression


By rewriting x into (xT , 1)T to reduce w0 , then NLL is given by:

N LL(w) = (y − Xw)T (y − Xw) + λwT w

So:

N LL(w) = 2XT Xw − 2XT y + 2λw
∂w
Therefore:
w = (XT X + λI)−1 XT y
7 LINEAR REGRESSION 43

7.4 MlE for σ 2 for linear regression


Firstly, we give the likelihood:

p(D|w, σ 2 ) =p(y|w, σ 2 , X)
N
Y
= p(yn |xn , w, σ 2 )
n=1
N
Y
= N (yn |wT xn , σ 2 )
n=1
( N
)
1 1 X T 2
= N exp − 2 (yn − w xn )
(2πσ 2 ) 2 2σ n=1

As for σ 2 :
N
∂ 2 N 1 X
log p(D|w, σ ) = − + (yn − wT xn )2
∂σ 2 2σ 2 2(σ 2 )2 n=1

We have:
N
2 1 X
σM L = (yn − wT xn )2
N n=1

7.5 MLE for the offset term in linear regression


NLL:
N
X
N LL(w, w0 ) ∝ (yn − w0 − wT xn )2
n=1

Differentiate with two parameters:


N
∂ X
N LL(w, w0 ) ∝ −N w0 + (yn − wT xn )
∂w0 n=1

N
1 X
w0,M L = (yn − wT xn ) = ȳ − wT x̄
N n=1
Centering within X and y:

Xc = X − X̂

yc = y − ŷ
7 LINEAR REGRESSION 44

The centered datasets have zero-mean, thus regression model have w0


as zero, by the same time:

wM L = (XTc Xc )−1 XTc yc

7.6 MLE for simple linear regression


Using the conclusion from problem 7.5. What left is straightforward
algebra.

7.7 Sufficient statistics for online linear regression


a and b can be solved according to hints.
For c, substituting the x in hint by y yields to the conclusion.
In d we are to prove:

(n+1) (n)
(n + 1)Cxy = nCxy + xn+1 yn+1 + nx̄(n) ȳ (n) − (n + 1)x̄(n+1) ȳ (n+1)

1
Expand the Cxy in two sides and use x̄(n+1) = x̄(n) + n+1
(xn+1 − x̄n ).
Problem e and f: practice by yourself.

7.8 Bayesian linear regression in 1d with known σ 2


Problem a: practice by yourself.
For b, choose the prior distribution:
 
1
p(w) ∝ N (w1 |0, 1) ∝ exp − w12
2

Reduce it into:
p(w) = N (w|w0 , V0 ) ∝
 
1 1 −1
exp − V−1 (w 0 − w 00 )2
− V (w 1 − w 01 )2
− V −1
(w 0 − w 00 )(w 1 − w 01 )
2 0,11 2 0,22 0,12

Formly, we take:
w01 = 0

V−1
0,22 = 1

V−1 −1
0,11 = V0,12 = 0
7 LINEAR REGRESSION 45

w00 = arbitrary
In problem c, we consider the posterior distribution for parameters:
N
Y
p(w|D, σ 2 ) = N (w|m0 , V0 ) N (yn |w0 + w1 xn , σ 2 )
n=1

The coefficients for w12 and w1 in the exponential are:


N
1 1 X 2
− − 2 x
2 2σ n=1 n
N
1 X
− 2 xn (w0 − y)
σ n=1
Hence the posterior mean and variance are given by:
2 σ2
σpost = PN
σ2 + n=1 x2n
N
2 2 1 X
E[w1 |D, σ ] = σpost (− xn (w0 − y))
σ 2 n=1
It can be noticed that accumulation of samples reduces the posterior
variance.

7.9 Generative model for linear regression


For sake of convinence, we consider a centered dataset(without chang-
ing symbols):
w0 = 0
µx = µy = 0
By covariance’s definition:

ΣXX = X T X

ΣY X = Y T X
Using the conclusion from section 4.3.1:

p(Y |X = x) = N (Y |µY |X , ΣY |X )

Where:

µY |X = µY + ΣY X Σ−1 T T
XX (X − µX ) = Y X(X X)
−1
X = wT X
7 LINEAR REGRESSION 46

7.10 Bayesian linear regression using the g-prior


Recall ridge regression model, where we have likelihood:
N
Y
p(D|w, σ 2 ) = N (yn |wT xn , σ 2 )
n=1

The prior distribution is Gaussian-Inverse Gamma distribution:

p(w, σ 2 ) =N IG(w, σ 2 |w0 , V0 , a0 , b0 ) = N (w|w0 , σ 2 V0 )IG(σ 2 |a0 , b0 )


 
1 1 1 T 2 −1
= D 1 exp − (w − w0 ) (σ V0 ) (w − w0 ) ·
(2π) 2 |σ 2 V0 | 2 2
a0  
b0 b0
(σ 2 )−(a0 +1) exp − 2
Γ(a0 ) σ
ba0 0 (w − w0 )T V−1
 
2 −(a0 + D
2 +1) · exp
0 (w − w0 ) + 2b0
= D 1 (σ ) −
(2π) 2 |V0 | 2 Γ(a0 ) 2σ 2

The posterior distribution takes the form:

p(w, σ 2 |D) ∝ p(w, σ 2 )p(D|w, σ 2 )

ba0 0 (w − w0 )T V−1
 
2 −(a0 + D
2 +1) ·exp
0 (w − w0 ) + 2b0
∝ D 1 (σ ) − ·
(2π) 2 |V0 | 2 Γ(a0 ) 2σ 2
( P )
N
2 −N n=1 (yn − wT xn )2
(σ ) 2 · exp −
2σ 2
Comparing the coefficient of σ 2 :
N
aN = a0 +
2
Comparing the coefficient of wT w:
N
X
V−1 −1
N = V0 + xn xTn = V−1 T
0 +X X
n=1

Comparing the coefficient of w:


N
X
V−1 −1
N wN = V0 w0 + yn xn
n=1

Thus:
wN = VN (V−1 T
0 w0 + X y)
7 LINEAR REGRESSION 47

Finally, comparing the constant term inside the exponential:


1
bN = b0 + (wT0 V−1 T T −1
0 w0 + y y − wN VN wN )
2
We have obtained 7.70 to 7.73, which can be concluded into 7.69:

p(w, σ 2 |D) = N IG(w, σ 2 |wN , VN , aN , bN )


8 LOGISTIC REGRESSION 48

8 Logistic regression
8.1 Spam classification using logistic regression
Practice by yourself.

8.2 Spam classification using naive Bayes


Practice by yourself.

8.3 Gradient and Hessian of log-likelihood for logistic re-


gression

∂ exp(−a) 1 e−a
σ(a) = = = σ(a)(1 − σ(a))
∂a (1 + exp(−a)) 2 1 + e 1 + e−a
−a


g(w) = N LL(w)
∂w
N
X ∂
= [yi log µi + (1 − yi ) log(1 − µi )]
n=1
∂w
N
X 1 −1
= yi σ(1 − σ) − xi + (1 − yi ) σ(1 − σ) − xi
n=1
σ 1−σ
N
X
= (σ(wT xi ) − yi )xi
n=1

For an arbitrary non-zero vectoru(with proper shape):

uT XT SXu = (Xu)T S(Xu)

Since S is positive definite, for arbitrary non-zero v:

vT Sv > 0

Assume X is a full-rank matrix, Xu is not zero, thus:

(Xu)T S(Xu) = uT (XT SX)u > 0

So XT SX is positive definite.
8 LOGISTIC REGRESSION 49

8.4 Gradient and Hessian of log-likelihood for multino-


mial logistic regression
By considering one independent component each time, the complexity
in form caused by tensor product is reduced. For a specific w∗ :
N C
∂ X ∂ ∗T
X
N LL(W) = − [y w xn − log(
∗ n∗
exp(wTc xn ))]
∂w∗ n=1
∂w c=1

N N
X exp(w∗T xn ) X
−yn∗ xn + PC )xn = (µn∗ − yn∗ )xn
exp(wTx
n=1 c=1 c n n=1

Combine the independent solutions for all classes into one matrix yield
8.38.
On soloving for Hessian matrix, consider to take gradient w.r.t w1 and
w2 :
N
∂ X
H1,2 = ∇w2 ∇w1 N LL(W) = (µn1 − yn1 )xn
∂w2 n=1
When w1 and w2 are the same:
N N
∂ X T
X ∂
(µn1 − yn1 )xn = µn1 xTn
∂w1 n=1 n=1
∂w 1

N
exp(wT1 xn )( exp)xn − exp(wT1 xn )2 xn T
X P
P xn
n=1
( exp)2
N
X
= µn1 (1 − µn1 )xn xTn
n=1

When w1 and w2 are different:


N N
∂ X X − exp(wT2 xn ) exp(wT1 xn )xn T
µn1 xTn = P xn
∂w2 n=1 n=1
( exp)2

N
X
= −µn1 µn2 xn xTn
n=1

Ends in 8.44。
P
The condition c ync = 1 is used from 8.34 to 8.35.
8 LOGISTIC REGRESSION 50

8.5 Symmetric version of l2 regularized multinomial lo-


gistic regression
Adding a regularizer equals doing a posterior estimationg, which equals
introducing a languarge multipler for a new constraint. In this problem a
Gaussian prior distribution with a homogeneous diagonal matrix is intro-
duced, this leads to the constraint wcj = 0.
At optima, the gradient in 8.47 goes to zero. Assume that µ̂cj = ycj ,
PC
then g(W) = 0. The extra regularization is λ c=1 wc = 0, which equals D
PC
independent linear constraints, with form of: for j = 1...D, c=1 ŵcj = 0.

8.6 Elementary properties of l2 regularized logistic re-


gression
The first term of J(w)’s Hessian is positive definite(8.7), the second
term’s Hessian is positive definite as well(λ > 0). Therefore this function
has a positive definite Hessian, it has a global optimum.
The form of posterior distribution takes:

p(w|D) ∝ p(D|w)p(w)

p(w) = N (w|0, σ −2 I)
1 T
N LL(w) = − log p(w|D) = − log p(D|w) + w w+c
2σ 2
Therefore:
1
λ=
2σ 2
The number of zero in global optimun is related to the value of λ, which
is in a negative correlationship with the prior uncertainty of w. The less
the uncertainty is, the more that w converges to zero, which ends in more
zeros in answer.
If λ = 0, which implies prior uncertainty goes to infnity. Then posterior
estimation converges to MLE. As long as there is no constraint on w, it is
possible that some component of w goes to infinity.
When λ increase, the prior uncertainty reduces, hence the over-fitting
effect reduces. Generally this implide a decrease on training-set accuracy.
8 LOGISTIC REGRESSION 51

At the same time, this also increases the accuracy of model on test-set, but
it does not always happen.

8.7 Regularizing separate terms in 2d logistic regression


Practice by yourself.
9 GENERALIZED LINEAR MODELS AND THE EXPONENTIAL FAMILY52

9 Generalized linear models and the


exponential family
9.1 Conjugate prior for univariate Gaussian in exponen-
tial family form
The 1d Gaussian distribution is:
 
1 1
N (x|µ, σ 2 ) = √ exp − 2 (x − µ)2
2πσ 2 2σ
Rewrite it into:
µ2 ln(2πσ 2 )
  
1 1
p(x|µ, σ ) = exp − 2 x2 + 2 x −
2
+
2σ σ 2σ 2 2
λµ2 ln(2π)
Denote θ = (− λ2 , λµ)T ,A(θ) = 2
+ 2
− ln λ
2
,φ(x) = (x2 , x)T .
Consider the likelihood with datasetD:
( N
)
X
log p(D|θ) = exp θT ( φ(xn )) − N · A(θ)
n=1

According to the meaning of prior distribution, we set a observation


background in order to define a prior distribution. The sufficient statistics
is the only thing matters by the form of exponential family. Assume that
we have M prior observations. The mean of them and their square are 为v1
and v2 respectively, then the prior distribution takes the form:

p(θ|M, v1 , v2 ) = exp {θ1 · M v1 + θ2 · M v2 − M · A(θ)}


 
λ M 2 M M
= exp − M v1 + λµM v2 − λµ − ln 2π + ln λ
2 2 2 2
It has three independent parameters. We are to prove that is equals
1
p(µ, λ) = N (µ|γ, λ(2α−1) )Ga(λ|α, β). Expand it into exponential form and
ignore the terms independent with µ, λ:
 
λ(2α − 1) 2 λ(2α − 1) 2
p(µ, λ) = exp (α − 1) ln λ − βλ − µ − γ
2 2
 
1
· exp λ(2α − 1)µγ + ln λ
2
9 GENERALIZED LINEAR MODELS AND THE EXPONENTIAL FAMILY53

Compare the coefficients for λµ2 , λµ, λ, ln λ, we obtain:


(2α − 1) M
− =−
2 2
γ(2α − 1) = M v2
(2α − 1) 2 1
γ − β = − M v1
2 2
1 M
(α − 1) + =
2 2
Combining them ends in:
M +1
α=
2
M 2
β= (v + v1 )
2 2
γ = v2

Thus two distributions are equal with naive change of variables’ names.

9.2 The MVN is in the exponential family


Here you can find a comprehensive solution:
https://stats.stackexchange.com/questions/231714/sufficient-statistic-for-multivari
10 DIRECTED GRAPHICAL MODELS(BAYES NETS) 54

10 Directed graphical models(Bayes nets)


...
11 MIXTURE MODELS AND THE EM ALGORITHM 55

11 Mixture models and the EM algorithm


11.1 Student T as infinite mixture of Gaussian
The 1d Student-t distribution takes the form:
Γ( v2 + 21 ) 1 1 (x − µ)2 − v+1
St(x|µ, σ 2 , v) = ( ) 2 (1 + ) 2
Γ( v2 ) πvσ 2 vσ 2

Consider the left side of 11.61:


σ2
Z
v v
N (x|µ, )Ga(z| , )dz
z 2 2
Z √ o ( v ) v2 v
z n z
2 −1 exp
n v o
= √ exp − 2 (x − µ)2 2
z − z dz
2πσ 2 2σ Γ( v2 ) 2
v v2 Z  2

1 (2) v−1 v (x − µ)
=√ v z 2 exp −( + )z dz
2πσ Γ( 2 )
2 2 2σ 2

The integrated function is the terms related to z in Gamma distribution


2
Ga(z| v+1
2
, (x−µ)
2σ 2
+ v2 ), which gives to the normalized term’s inverse.

v (x − µ)2 v + 1 (x − µ)2 v − v+1


Z  
v−1
z 2 exp −( + )z dz = Γ( )( + ) 2
2 σ2 2 2σ 2 2

Plug in can derive 11.61.

11.2 EM for mixture of Gaussians


We are to optimize:

XN
Q(θ, θold ) =Ep(z|D, θold )[ log(xn , zn |θ)]
n=1
N
X K
Y
= E[log (πk p(xn |zk , θ))znk ]
n=1 k=1
N X
X K
= rnk log(πk p(xn |zk , θ))
n=1 k=1

Where:
rnk = p(znk = 1|xn , θold )
11 MIXTURE MODELS AND THE EM ALGORITHM 56

When the emission distribution p(x|z, θ) is Gaussian, consider the terms


involveµk in Q(θ, θold ) first:
N N
X X 1
rnk log p(xn |zk , θ) = rnk (− )(xn − µk )T Σ−1 (xn − µk ) + C
n=1 n=1
2

Setting the derivative to zero results in:


N
X
rnk (µk − xn ) = 0
n=1

And we obtain 11.31:


PN
n=1 rnk xn
µk = P N
n=1 rnk
For terms involve Σk in Q(θ, θold ):
N N
X X 1
rnk log p(xn |zk , θ) = rnk (− )(log |Σk |+(xn −µk )T Σ−1 (xn −µk ))+C
n=1 n=1
2

Using the same way as in 4.1.3.1:


N
( N
)
X X
L(Σ−1 = Λ) = ( rnk ) log |Λ| − T r ( rnk (xn − µk )(xn − µk )T )Λ
n=1 n=1

The balance condition is:


N
X N
X
−T
( rnk )Λ = rnk (xn − µk )(xn − µk )T
n=1 n=1

Obtain 11.32:
PN
n=1 rnk (xn − µk )(xn − µk )T
Σk = PN
n=1 rnk

11.3 EM for mixtures of Bernoullis


During the MLE for mixtures of Bernoullis, consider(D = 2 marks the
number of potential elements):
N K N D
∂ XX X ∂ X
rnk log p(xn |θ, k) = rnk ( xni log µki )
∂µkj n=1 k=1 n=1
∂µkj i
N
X 1
= rnk xnj
n=1
µkj
11 MIXTURE MODELS AND THE EM ALGORITHM 57

P
Introduce a multipler to constrain j µkj = 1, then condition for the
derivative to be zero is:
PN
n=1 rnk xnj
µkj =
λ
Summer over all j:
D D N N D PN
X 1 XX 1X X rnk
1= µkj = rnk xnj = rnk xnj = n=1
j=1
λ j=1 n=1 λ n=1 j=1
λ

Results in:
N
X
λ= rnk
n=1

Hence 11.116。
Introduce a prior:
α−1 β−1
p(µk0 ) ∝ µk0 µk1

The zero-derivative condition becomes:


PN
rnk xn0 + α − 1
µk0 = n=1
λ
PN
rnk xn1 + β − 1
µk1 = n=1
λ
And:
N
1 X
1 = µk0 + µk1 = ( rnk (xn0 + xn1 ) + α + β − 2)
λ n=1

N
X
λ= rnk + α + β − 2
n=1

Hence 11.117。

11.4 EM for mixture of Student distributions


The log-likelihood for complete data set is:
Σ λ λ
lc (x, z) = log(N (x|µ, )Ga(z| , ))
z 2 2
D 1 D z
=− log(2π) − log |Σ| + log(z) − (x − µ)T Σ−1 (x − µ)+
2 2 2 2
11 MIXTURE MODELS AND THE EM ALGORITHM 58

v v v v v
log( ) − log(Γ( )) + ( − 1) log(z) − z
2 2 2 2 2
Sum the terms involving v:
v v v v
lv (x, z) = log( ) − log(Γ( )) + (log(z) − z)
2 2 2 2
The likelihood w.r.t v on complete data set is:
N
vN v v vX
Lv = log( ) − N log(Γ( )) + (log(zn ) − zn )
2 2 2 2 n=1

Setting derivative to zero gives:


PN
∇Γ( v2 ) v n=1 (log(zn ) − zn )
v − 1 − log( ) =
Γ( 2 ) 2 N

For µ and Σ:
1 z
lµ,Σ (x, z) = − log |Σ| − (x − µ)T Σ−1 (x − µ)
2 2
N
N 1X
Lµ,Σ = log |Σ| − zn (xn − µ)T Σ−1 (xn − µ)
2 2 n=1
Hence equals the MLE used for MVN.

11.5 Gradient descent for fitting GMM


From the given information:
X
p(x|θ) = πk N (x|µk , Σk )
k

N
X
l(θ) = log p(xn |θ)
n=1

Deriavte w.r.t µk :
N
πk N (xn |µk , Σk )∇µk − 12 (xn − µk )T Σ−1

∂ k (xn − µk )
X
l(θ) = PK
∂µk n=1 k0 =1 πk N (xn |µk , Σk )
0 0 0

N
X
= rnk Σ−1
k (xn − µk )
n=1
11 MIXTURE MODELS AND THE EM ALGORITHM 59

w.r.t πk :
N N
∂ X N (xn |µk , Σk ) 1 X
l(θ) = PK = rnk
∂πk n=1 k0 =1 πk N (xn |µk , Σk )
0 0 0 πk n=1

Using Languarge multipler ends in:


PN
rnk
πk = n=1
λ
Sum over k and normalize:
PN
n=1 rnk
πk =
N
For Σk :
N
∂ X πk ∇Σk N (xn |µk , Σk )
l(θ) = PK
∂Σk n=1 k0 =1 pik N (xn |µk , Σk )
0 0 0

Where:
 
1 1 1 T −1
∇Σk N (x|µk , Σk ) = D 1 exp − (x − µ k ) Σk (x − µ k ) ∇Σk ·
(2π) 2 |Σk | 2 2
 
1 −1 −1
∇Σk (− (x − µk )Σk (x − µk )) − Σk ∇Σk |Σk |
2
= N (x|µk , Σk )∇(log N (x|µk , Σk ))

Thus we have:
PN
n=1 rnk (xn − µk )(xn − µk )T
Σk = PN
n=1 rnk

11.6 EM for a finite scale mixture of Gaussians


J and K are independent, using Bayes’ rules(we have omitted θ in
condition w.l.o.g):

p(Jn = j, Kn = k, xn )
p(Jn = j, Kn = k|xn ) =
p(xn )
p(Jn = j)p(Kn = k)p(xn |Jn = j, Kn = k)
= P
Jn ,Kn p(Jn , Kn , xn )
pj qk N (xn |µj , σk2 )
= Pm Pl 2
Jn =1 Kn =1 pJn qKn N (xn |µJn , σK n
)
11 MIXTURE MODELS AND THE EM ALGORITHM 60

Derive the form of auxiliary fucntion Q(θnew , θold ):


N
X
Q(θnew , θold ) =Eθold log p(xn , Jn , Kn |θnew )
n=1
N
X m Y
Y l
= E[log( p(xn , Jn , Kn |θnew )I(Jn =j,Kn =k) )]
n=1 j=1 k=1

X m X
N X l
= E(I(Jn = j, Kn = k))(log pj + log qk + log N (xn |µj , σk2 ))
n=1 j=1 k=1
X X X
= rnjk log pj + rnjk log qk + rnjk log N (xn |µj , σk2 )
n,j,k n,j,k njk

We are to optimize parameters p, q, µ, σ 2 . It is noticealbe that p and q


can be optimized independently. Now fix σ 2 and optimize µ:
∂ X X
rnj 0 k N (xn |µj , σk2 ) = rnjk ∇µk N (xn |µj , σk2 )
∂µj n,j 0 ,k n,k
X x n − µj
= rnjk N (xn |µj , σk2 )
n,k
σk2

And we ends in:


rnjk N (xn |µj , σk2 ) xσn2
P
n,k k
µj = P 2 1
n,k rnjk N (xn |µj , σk ) σ 2 k

11.7 Manual calculation of the M step for a GMM


Practise by yourself.

11.8 Moments of a mixture of Gaussians


For the expectation of mixture distribution:
Z X
E(x) = x πk N (x|µk , Σk )dx
k
X Z
= πk ( xN (x|µk , Σk )dx)
k
X
= πk µk
k
11 MIXTURE MODELS AND THE EM ALGORITHM 61

Using cov(x) = E(xxT ) − E(x)E(x)T , we have:


Z X
E(xxT ) = xxT πk N (x|µk , Σk )dx

X Zk
= πk xxT N (x|µk , Σk )dx
k

Where:
Z
xxT N (x|µk , Σk )dx =EN (µk ,Σk ) (xxT )

=covN (µk ,Σk ) (x) + EN (µk ,Σk ) (x)EN (µk ,Σk ) (x)T
=Σk + µk µTk

Therefore:
X
cov(x) = πk (Σk + µk µTk ) − E(x)E(x)T
k

11.9 K-means clustering by hand


Practise by yourself.

11.10 Deriving the K-means cost function


For every term sum over k, apply 11.134 onto the inner and outer sum
process:
X X X
(xi − xi0 )2 = nk s2 + nk (x̄k − xi )2
i:zi =k i0 :zi0 =k i:zi =k

=n2k s2 + nk (nk s2 )
=2nk sk

The right side of 11.131 equals to sum over k:


X
nk (xi − x̄k )2 = nk (nk s2 + n(x̂n − x̂n ))
i:zi =k

Thus 11.131.
11 MIXTURE MODELS AND THE EM ALGORITHM 62

11.11 Visible mixtures of Gaussians are in exponential


family
Encode latent variable with hot-pot code, zc = I(x is generated from
the c distribution), then(omit θ in condition w.l.o.g):
C
Y
p(z) = πczc
c=1

C  
Y 1 1
p(x|z) = (p 2
exp − 2 (x − µc ) )zc
c=1
2πσc
2 2σc
The log for joint distribution is:
C  
Y πc 1
log p(x, z) = log (p exp − 2 (x − µc )2 )zc
c=1
2πσc2 2σc
C
X 1 1
= zc (log πc − log 2πσc2 − 2 (x − µc )2 )
c=1
2 2σc

Which is a sum of some inner products, hence an exponential family.The


sufficient statics are linear combinations of z, zx and zx2 .

11.12 EM for robust linear regression with a Student t


likelihood
Using the complete data likelihood w.r.t µ derived in 11.4.5:
N
1 X
LN (µ) = zn (yn − wT xn )2
2σ 2 n=1

Set the deriavte to zero:


N
X N
X
T
w zn xn xTn = zn yn xTn
n=1 n=1

This means:
XN N
X
wT = ( zn yn xTn )( zn xn xTn )−1
n=1 n=1
11 MIXTURE MODELS AND THE EM ALGORITHM 63

11.13 EM for EB estimation of Gaussian shrinkage model


For every j, 5.90 takes different forms(this equals E-step):

p(x̄i |µ, t2 , σ 2 ) = N (x̄j |µ, t2 + σj2 )

Integrate out θj , the marginal likelihood is given by:


D D
Y 1 X 1
log N (x̄j |µ, t2 + σj2 ) = (− ) log 2π(t2 + σj2 ) + 2 2
(x̄j − µ)2
j=1
2 j=1
t + σj

Then we optimize respectively(this equals M-step):


PD x̄j
j=1 t2 +σj2
µ = PD 1
j=1 t2 +σj2

t2 satisfies:
D
X (t2 + σ 2 ) − (x̄j − µ)2
j=1
(t2 + σj2 )2

11.14 EM for censored linear regression


Unsolved.

11.15 Posterior mean and variance of a truncated Gaus-


sian
ci −µi
We denote A = σ
, for mean:

E[zi |zi ≥ ci ] = µi + σE[i |i ≥ A]

And we have:
Z +∞
1 φ(A)
E[i |i = i N (i |0, 1)dx = = H(A)
p(i ≥ A) A 1 − Φ(A)
In the last step we use 11.141 and 11.139, plug it up:

E[zi |zi ≥ ci ] = µi + σH(A)

Now to calculate the expectation for square term:

E[zi2 |zi ≥ ci ] = µ2i + 2µi σE[i |i ≥ A] + σ 2 E[2i |i ≥ A]


11 MIXTURE MODELS AND THE EM ALGORITHM 64

To address E[2i |i ≥ A], expand the hint from question:


d
(wN (w|0, 1)) = N (w|0, 1) − w2 N (w|0, 1)
dw
We have:
Z c
w2 N (w|0, 1)dw = Φ(c) − Φ(b) − cN (c|0, 1) + bN (b|0, 1)
b

+∞
1 − Φ(A) + Aφ(A)
Z
1
E[2i |i ≥ A] = w2 N (w|0, 1)dw =
p(i ≥ A) A 1 − Φ(A)
Plug it into the conclusion drawn from question a:
1 − Φ(A) + Aφ(A)
E[zi2 |zi ≥ ci ] = µ2i + 2µi σH(A) + σ 2
1 − Φ(A)

= µ2i + σ 2 + H(A)(σci + σµi )


12 LATENT LINEAR MODELS 65

12 Latent linear models


12.1 M-step for FA
Review the EM for FA(Fator-Analysis) first. Basically, we have(centralize
X to cancel µ w.l.o.g):
p(z) = N (z|0, I)

p(x|z) = N (x|Wz, Ψ)

And:
p(z|x) = N (z|m, Σ)

Σ = (I + WT Ψ−1 W)−1

m = ΣWT Ψ−1 xn

Denote xn ’s latent variable as zn . The log-likelihood for complete data


set{x, z} is:
N
Y N
X
log p(xn , zn ) = log p(zn ) + log p(xn |zn )
n=1 n=1

With prior log p(z) that can be omitted with parameter 0 and I, hence:

XN
Q(θ, θold ) =Eθold [ log p(xn |zn , θ)]
n=1
N
X 1 1
=E[ c− log |Ψ| − (xn − Wzn )T Ψ−1 (xn − Wzn )]
n=1
2 2
N
N 1X
=C − log |Ψ| − E[(xn − Wzn )T Ψ−1 (xn − Wzn )]
2 2 n=1
N N N
N 1 X T −1 1X X
=C − log |Ψ| − xn Ψ xn − E[zTn WT Ψ−1 Wzn ] + xTn Ψ−1 WE[zn ]
2 2 n=1 2 n=1 n=1
N N N
N 1X 1X X
xn Ψ−1 xn − T r WT Ψ−1 WE[zn zTn ] + xTn Ψ−1 WE[zn ]

=C − log |Ψ| −
2 2 n=1 2 n=1 n=1

As long as p(z|x, θold ) = N (z|m, Σ), we have:

E[zn |xn ] = ΣWT Ψ−1 x


12 LATENT LINEAR MODELS 66

E[zn zTn |xn ] = cov(zn |xn )+E[zn |xn ]E[zn |xn ]T = Σ+(ΣWT Ψ−1 x)(ΣWT Ψ−1 x)T

From now on, the x and θold are omitted from conditions when calcu-
lating expectation.
Optimize w.r.t W:
N N
∂ X X
Q= Ψ−1 xn E[zn ]T − Ψ−1 WE[zn zTn ]
∂W n=1 n=1

Set it to zero:
XN N
X
W=( xn E[zn ]T )( E[zn zTn ])−1
n=1 n=1

Optimize w.r.t Ψ−1 :


N N N
∂ N 1X 1X X
−1
Q= Ψ− xn xTn − WE[zn zTn ]WT + WE[zn ]xn
∂Ψ 2 2 n=1 2 n=1 n=1

Plug in the expression of W:


N
1 X
Ψ= ( xn xTn − WE[zn ]xTn )
N n=1

Assume Ψ to be a diagnal matrix:


N
1 X
Ψ= diag( xn xTn − WE[zn ]xTn )
N n=1

This solution comes from ”The EM Algorithm for Mixtures of Factor


Analyzers, Zoubin Gharamani, Geoffrey E.Hinton, 1996”, where the EM for
mixtures of FA is given as well.

12.2 MAP estimation for the FA model


Assume prior p(W) and p(Ψ). Compare with the question before, the
M-step needs to be moderated:

(Q + log p(W)) = 0
∂W

(Q + log p(Ψ)) = 0
∂Ψ
12 LATENT LINEAR MODELS 67

12.3 Heuristic for assessing applicability of PCA*


Need pictures for illustration here!

12.4 Deriving the second principal component


For:
N
1 X
J(v2 , z2 ) = (xn − zn1 v1 − zn2 v2 )T (xn − zn1 v1 − zn2 v2 )
N n=1

Consider the derivative w.r.t one component of z2 :


∂ 1
J = (2zm2 vT2 v2 − 2vT2 (xm − zm1 v1 )) = 0
∂zm2 N
Using vT2 v2 = 1 and vT2 v1 = 0 yields to:

zm2 = vT2 xm

Since C is symmitric, use the constrain on v1 and v2 . We apply SVD


onto C first:
C = OT ΛO

Where:
Λ = diag {λ1 , λ2 , ...}

Are C’s eigenvalues from the largest to the smallest.

OT = {u1 , u2 , ...}

Are eigenvectors, that are vertical to each other uTi uj = I(i = h).
Withu1 = v1 .
Under constrains vT2 v2 = 1 and vT2 v1 = 0, we are to minimize:

(Ov2 )T Λ(Ov2 )

Notice Ov2 means a transform on v2 , with its length unchanged. And


(Ov2 )T Λ(Ov2 ) measures the sum of the vector’s components’ square timed
by Λ’s eigenvalues. Hence the optimum is reached with all length converges
to the component associated to the largest eigenvalue, which means:

uTi v2 = I(i = 2)
12 LATENT LINEAR MODELS 68

Therefore:
v2 = u2

12.5 Deriving the residual error for PCA

K
X K
X K
X
2 T
||xn − znj vj || =(xn − znj vj ) (xn − znj vj )
j=1 j=1 j=1
N
X N
X
=xTn xn + 2
znj − 2xTn znj vj
j=1 j=1

Use vTi vj = I(i = j), znj = xTn vj . We ends in the conclusion of a.


K
X K
X
||xn − znj vj ||2 = xTn xn − 2 vTj xn xTn vj
j=1 j=1

Plug in vTj Cvj = λj and sum over n can draw the conclusion in b.
Plug K = d into the conclusion in b, we have:
N d
1 X T X
JK=d = xn xn − λj = 0
N n=1 j=1

N d
1 X T X
xn xn − λj = 0
N n=1 j=1

In general cases:
d
X K
X K
X
JK = λj − λj = λj
j=1 j=1 j=d+1

12.6 Derivation of Fisher’s linear discriminant


Straightforward algebra.
(need reference)

12.7 PCA via successive deflation


This problem involves the same technique used in solving 12.4, hence
omitted.
12 LATENT LINEAR MODELS 69

12.8 Latent semantic indexing


Practice by yourself.

12.9 Imputation in a FA model*


wtfxv ?
wtfxh ?

12.10 Efficiently evaluating the PPCA density


With:
p(z) = N (z|0, I)

p(x|z) = N (x|Wz, σ 2 I)

Use the conclusion from chapter 4.

N (x) = N (x|0, σ 2 I + WWT )

Deriavtion for MLE in 12.2.4 can be found in ”Probabilistic Principal


Component Analysis,Michael E.Tipping, Christopher M.Bishop,1999”.
Plug in the MLE, thence the covariance matrix(D ∗ D)’s inverse can be
computed:
1
(σ 2 I + WWT )−1 = σ −2 I − σ −2 W( WT W + σ −2 I)−1 WT σ −2
σ −2
Which involves only inversing a L ∗ L matrix.

12.11 PPCA vs FA
Practice by youself.
13 SPARSE LINEAR MODELS 70

13 Sparse linear models


13.1 Partial derivative of the RSS
Define:
N
X
RSS(w) = (yn − wT xn )2
n=1

Straightforwardly:
N
∂ X
RSS(w) = 2(yn − wT xn )(−xnj )
∂wj n=1
N
X D
X
=− 2(xnj yn − xnj wi xni )
n=1 i=1
N
X D
X
=− 2(xnj yn − xnj wi xni − x2nj wj )
n=1 i6=j

With wj ’s coefficient:
N
X
aj = 2 x2nj
n=1

Other irrelevent terms can be absorbed into:


N
X
cj = 2 xnj (yn − wT−j xn,−j )
n=1

In the end:
cj
wj =
aj

13.2 Derivation of M-step for EB for linear regression


We give the EM for Automatic Relevance Determination(ARD). For
linear regression scene:

p(y|x, w, β) = N (y|Xw, β −1 )

p(w) = N (w|0, A−1 )

A = diag(α)
13 SPARSE LINEAR MODELS 71

In E-step, we are to estimate expectation of w. Using linear Gaussian


relationship:
p(w|y, α, β) = N (µ, Σ)
Σ−1 = A + βXT X
µ = Σ(βXT y)
Then:
Eα,β [w] = µ
Eα,β [wwT ] = Σ + µµT
For auxiliay function:

Q(α, β, αold , β old ) =Eαold ,β old [log p(y, w|α, β)]


=E[log p(y|w, β) + log p(w|)]
1 X
= E[N log β − β(y − Xw)T (y − Xw) + log αj − wT A−1 w]
2 j

In E-step, we need E[w] and E[wwT ], which have been computed:


Introduce a prior for component in α and β:
Y
p(α, β) = Ga(αj |a + 1, b) · Ga(β|c + 1, d)
j

Hence the posterior auxiliary function is:


X
Q0 = Q + log p(α, β) = Q + (a log αj − bαj ) + (c log β − dβ)
j

In M-step, optimize w.r.t αi :


∂ 0 1 E[wi2 ] a
Q = − + −b
∂αi 2αi 2 αi
Set it to zero:
1 + 2a
αi =
E[wi2 ] − b
Optimize w.r.t β:
∂ 0 N c
Q = − E[||y − Xw||2 ] + − d
∂β 2β β
End in:
N + 2c
β=
E[||y − Xw||2 ] + 2d
Expand the expectation ends in 13.168.
13 SPARSE LINEAR MODELS 72

13.3 Derivation of fixed point updates for EB for linear


regression*
Unsolved.

13.4 Marginal likelihood for linear regression*


Straightforward algebra.

13.5 Reducing elastic net to lasso


Expand both sides of 13.196, the right side:

J1 (cw) =(y − cXw)T (y − cXw) + c2 λ2 wT w + λ1 |w|1


=yT y − c2 wT XT Xw − 2yT Xw + c2 λ2 wT w + λ1 |w|1

The left side:


!T !
y − cXw y − cXw
J2 (w) = √ √ + cλ1 |w|1
−c λ2 w −c λ2 w
=(y − cXw)T (y − cXw) + c2 λ2 wT w + cλ1 |w|1
=yT y + c2 wT XT Xw − 2yT Xw + c2 λ2 wT w + cλ1 |w|1

Hence 13.196 and 13.195 are equal.


This shows elastic net regularization, which pick a regularing term as
a linear combination of l1 andl0 equals a lasso one.

13.6 Shrinkage in linear regression


For ordinary least square:

RSS(w) = (y − Xw)T (y − Xw)

Using XT X = I:

RSS(w) = c + wT w − 2yT Xw

Take the derivative:


N
∂ X
RSS(w) = 2wk − 2 yn xnk
∂wk n=1
13 SPARSE LINEAR MODELS 73

We have:
N
X
ŵkOLS = yn xnk
n=1
In ridge regression:

RSS(w) = (y − Xw)T (y − Xw) + λwT w

Take the derivative:


N
X
(2 + 2λ)wk = 2 yn xnk
n=1

Thus PN
yn xnk
ŵkridge = n=1
1+λ
Solution for lasso regression using subderivative is exploited in 13.3.2,
which concludes in 13.63:
λ
ŵklasso = sign(ŵkOLS )(|ŵkOLS | − )+
2
Observe picture 13.24, it is easy to address the black line as OLS, gray
one Ridge and dotted one lasso. And λ1 = λ2 = 1. It is noticeable that
ridge cause a shrinkage to horizontal axis while lasso cause a sharp shrinkage
to zero under certain threshold.

13.7 Prior for the Bernoulli rate parameter in the spike


and slab model

D
Y
p(γ|α1 , α2 ) = p(γd |α1 , α2 )
d=1
Integrate out πd :
Z
1
p(γd |α1 , α2 ) = p(γd |πd )p(πd |α1 , α2 )dπd
B(α1 , α2 )
Z
1
= πdγd (1 − πd )(1−γd ) πdα1 −1 (1 − πd )α2 −1 dπd
B(α1 , α2 )
Z
1
= πdα1 +γd −1 (1 − πd )α2 +1−γd −1 dπd
B(α1 , α2 )
B(α1 + γd , α2 + 1 − γd ) Γ(α1 + α2 ) Γ(α1 + γd )Γ(α2 + 1 − γd )
= =
B(α1 , α2 ) Γ(α1 )Γ(α2 ) Γ(α1 + α2 + 1)
13 SPARSE LINEAR MODELS 74

Therefore(N1 marks the number of 1 in γ):

Γ(α1 + α2 )N Γ(α1 + 1)N1 Γ(α2 + 1)N −N1


p(γ|α1 , α2 ) =
Γ(α1 )N Γ(α2 )N Γ(α1 + α2 + 1)N
N1 N −N1
(α1 + 1) (α2 + 1)
=
(α1 + α2 + 1)N
And:
α2 + 1 α1 + 1
log p(γ|α1 , α2 ) = N log + N1 log
α1 + α2 + 1 α2 + 1

13.8 Deriving E step for GSM prior

γ2
Z
1
Lap(wj |0, ) = N (wj |0, τj2 )Ga(τj2 |1, )dτj2
γ 2
Take Laplace transform/generating transform to both sides:
To calculate:
1 p(wj |τj2 )p(τj2 ) 2
Z Z
1 1 2 2
E[ |wj ] = p(τ |w j )dτ = dτj
τj2 τj2 j j
τj2 p(wj )
Z
1 1
= N (wj |0, τj2 )p(τj2 )dτj2
p(wj ) τj2

According to 13.200, it reduces to:


1 −1 d
Z
N (wj |0, τj2 )p(τj2 )dτj2
p(wj ) |wj | dwj

Because:
d 1 d
log p(w) = p(w)
dw p(w) dw
This gives 13.197:
1 −1 d 1 d
p(wj ) = − log p(wj )
p(wj ) |wj | dwj |wj | dwj

!此题存疑,Hint 1和Hint 2中可能均有印刷错误。

13.9 EM for sparse probit regression with Laplace prior


Straightforward Probit regression involves no latent variable. Intro-
ducing Laplace prior for linear factor w results in its lasso version. Since
13 SPARSE LINEAR MODELS 75

Laplace distribution is a continuous mixture of Gaussian, a latent variable


τ 2 with the same dimension as w is introduced. The PGM for Probit re-
gression looks like:
γ → τ2 → w → y ← X

The joint distribution is:


D
Y D
Y N
Y
2
p(γ, τ , w, y|X) = p(γ) p(τd2 |γ) p(wd |τd2 ) Φ(wT xn )yn (1−Φ(wT xn ))1−yn
d=1 d=1 n=1

For concise, we set γ as constant, according to 13.86:


γ2
p(τ 2 |γ) = Ga(τd2 |1, )
2
p(wd |τd2 ) = N (wd |0, τd2 )

Hence:
D
( ) D
2 1 X 2 2 wd2 Y 1
p(τ , w, y|X, γ) ∝ exp − (γ τd + 2 ) ·
2 d=1 τd d=1
τd

N
Y
· Φ(wT xn )yn (1 − Φ(wT xn ))1−yn
n=1
new old
In Q(θ ,θ ), we take expectation of θold . We have assumed w as
parameter and τ 2 as latent variable, thus:

Q(w, wold ) = Ewold [log p(y, τ 2 |w)]

Now extract terms involve w from log p(τ 2 , w, y):


D N
2 1 X wd2 X
log p(y, τ |w) = c − + yn log Φ(wT xn ) + (1 − yn )(1 − Φ(wT xn ))
2 d=1 τd2 n=1

Thus we only need to calculate one expectation in E-step:


1 old
E[ |w ]
τd2
Which can be done as in 13.4.4.3, because Probit and linear regression
share the same PGM up to this stage.
The M-step is the same as Gaussian-prior Probit regression hence omit-
ted.
13 SPARSE LINEAR MODELS 76

13.10 GSM representation of group lasso*


Follow the hints and straightforward algebra.

13.11 Projected gradient descent for l1 regularized least


squares
Generally, we take gradient on w and optimize. When there are con-
strains on w that could be broken by gradient descent, the increment has
to be moderated to fit in the constrains.
To calculate:
minw {N LL(w) + λ||w||1 }

Consider under a linear regression context:


1
N LL(w) = ||y − Xw||22
2
For λ||w||1 can not be differentiate, we need a non-trivial solution, it
is suggest:
w=u−v

ui = (xi )+ = max {0, xi }

vi = (−xi )+ = max {0, −xi }

With u ≥ 0, v ≥ 0, then:

||w||1 = 1Tn u + 1Tn v

The original problem is changed to:


 
1 2 T T
minw ||y − X(u − v)||2 + λ1n u + λ1n v
2
s.t.u ≥ 0, v ≥ 0

Denote: !
u
z=
v
Rewrite the original target:
 
1
minz f (z) = cT z + zT Az
2
13 SPARSE LINEAR MODELS 77

s.t.z ≥ 0

Where: !
λ1n − yX
c=
λ1n + yX
!
XT X −XT X
A=
−XT X XT X
The gradient is given by:

∇f (z) = c + Az

For ordinary gradient descent:

zk+1 = zk − α∇f (zk )

For projected case, take gk :

gki = min zki , α∇f (zk )i




During iteration:
zk+1 = zk − gk

The original paper suggest more delicate method to moderate the learn-
ing rate, refer to ”Gradient Projection for Sparse Reconstruction: Applica-
tion to Compressed Sensing and Other Inverse Problems, Mario A.T.Figueiredo”.

13.12 Subderivative of the hinge loss function

if (θ < 1)∂f (θ) = {−1}

if (θ = 1)∂f (θ) = [−1, 0]

if (θ > 1)∂f (θ) = {0}

13.13 Lower bounds to convex functions


Refer to ”Rigorous Affine Lower Bound Functions for Multivariate
Polynomials and Their Use in Global Optimisation”.
14 KERNELS 78

14 Kernels
15 GAUSSIAN PROCESSES 79

15 Gaussian processes
15.1 Reproducing property
We denote κ(x1 , x) by f (x) and κ(x2 , x) by g(x). From definition:

X
f (x) = fi φ(x)
i=1


X
κ(x1 , x) = λi φi (x1 )φi (x)
i=1

Since x can be chosen arbitrarily, we have the properties hold(the one


for g is obtained similarly):

fi = λi φi (x1 )

gi = λi φi (x2 )

Therefore:

< κ(x1 , .), κ(x2 , .) >= < f, g >



X fi gi
=
i=1
λi
X∞
= λi φi (x1 )φi (x2 )
i=1

=κ(x1 , x2 )
16 ADAPTIVE BASIS FUNCTION MODELS 80

16 Adaptive basis function models


16.1 Nonlinear regression for inverse dynamics
Practise by yourself.
17 MARKOV AND HIDDEN MARKOV MODELS 81

17 Markov and hidden Markov models


17.1 Derivation of Q function for HMM
Firstly, we estimate the distribution of z1:T w.r.t θold , for auxiliay func-
tion, we are to calculate the log-likelihood w.r.t θ and z1:T .

Q(θ, θold ) =Ep(z1:T |x1:T ,θold ) [log p(z1:T , x1:T |θ)]


(N ( Ti Ti
))
Y Y Y
=Ep [log p(zi,1 |π) p(zi,t |zi,t−1 , A) p(xi,t |zi,t , B) ]
i=1 t=2 t=1
N X
X K X Ti X
N X K X
K
=Ep [ I[zi,1 = k] log πk + I[zi,t = k, zi,t−1 = j] log A(j, k)
i=1 k=1 i=1 t=2 j=1 k=1

X Ti X
N X K
+ I[zi,t = k] log p(xi,t |zi,t = k, B)]
i=1 t=1 k=1

Further we have 17.98, 17.99, 17.100, using the definition of expectation


yields to 17.97.

17.2 Two filter approach to smoothing in HMMs


For rt (i) = p(zt = i|xt+1:T ), we have:
X
p(zt = i|xt+1:T ) = p(zt = i, zt+1 = j|xt+1:T )
j
X
= p(zt+1 = j|xt+1:T )p(zt = i|zt+1 = j, xt+1:T )
j
X
= p(zt+1 = j|xt+1:T )p(zt = i|zt+1 = j)
j
X
= p(zt+1 = j|xt+1:T )Ψ− (j, i)
j

Where Ψ denotes the transform matrix in an inverse sense, we further
have:

p(zt+1 = j|xt+1:T ) =p(zt+1 = j|xt+1 , xt+2:T )


∝p(zt+1 = j, xt+1 , xt+2:T )
=p(xt+2:T )p(zt+1 = j|xt+2:T )p(xt+1 |zt+1 = j, xt+2:T )
∝rt+1 (j)φt+1 (j)
17 MARKOV AND HIDDEN MARKOV MODELS 82

Therefore we can calculate rt (i) recursively:


X
rt (i) ∝ rt+1 (j)φt+1 (j)Ψ− (j, i)
j
Q
And initial element p(zT ) is given by T (i).
To rewrite γt (i) in terms of new factors:

γt (i) ∝p(zt = i|x1:T )


∝p(zt = i, x1:T )
=p(zt = i)p(x1:T |zt = i)
=p(zt = i)p(x1:t |zt = i)p(xt+1:T |zt = i, x1:t )
=p(zt = i)p(x1:t |zt = i)p(xt+1:T |zt = i)
1
= p(x1:t , zt = i)p(xt+1:T , zt = i)
p(zt = i)
1
∝ p(zt = i|x1:t )p(zt = i|xt+1:T )
p(zt = i)
αt (i) · rt (i)
= Q
t (i)

17.3 EM for HMMs with mixture of Gaussian observa-


tions
Using mixture of Gaussians as the emission distribution does not the
evaluation of γ and , hence the E-step does not change compared to the
one in exercise 17.1.
As long as A and B are estimated independently, we are now focus on
estimating B = (π, µ, Σ) during M-step, the involved target function is:
K X
X Ti
N X
γi,t (k) log p(xi,t |B)
k=1 i=1 t=1

Since the parameters are independent w.r.t k, we delve into a case


where k is given. We also denote the iteration through i = 1 to N and t = 1
PN
to Ti by n = 1 to T = i=1 Ti , now the log-likelihood takes the form:
T
X
γn (k) log p(xn |πk , µk , Σk )
n=1
17 MARKOV AND HIDDEN MARKOV MODELS 83

It can be seen as a weighted form of log-likelihood for a mixture of


Gaussian, assume the mixture contains C(it should be Ck , but this notation
causes no contradiction as long as we take k for granted) Gaussians. We are
to apply another EM procedure during the M-step for this HMM. Denote
the latent variable corresponding to xn by hn,k . Estimate the distribution
of p(hn,k |zn , πk , µk , Σk ) is tantamount to the E-step used in handling tradi-
tional mixture of Gaussians. Denote the expectation of hn,k ’s components
0
by γc,n (k).
Now applying the M-step of mixture of Gaussians, recall that auxiliay
takes the form:
T
X C
X
0
γn (k) γc,n (k) {log πk,c + log N (xn |µk,c , Σk,c )}
n=1 c=1

Hence this HMM reweighted a traditional mixture of Gaussians, with


0 0
the weight changed from γc,n (k) into γn (k) · γc,n (k). The rest estimation
is trivially the application of M-step in mixture of Gaussians using new
weights.

17.4 EM for HMMs with tied mixtures


Recall the conclusion from exercise 17.3, the last M-step inside M-step
takes the form:
K X
X T X
C
γc,n (k) {log πk,c + log N (xn |µc , Σc )}
k=1 n=1 c=1

Where we accordingly update the meaning of γ, and we also remove k


from the footnotes of µ and Σ given the conditions in this exercise.
It is easy to notice that this target function again takes the form of
M-step target for a traditional mixture of Gaussians. Taking independent
k and update πk gives the learning process of K mixing weights. Sum out
k and C independent Gaussian parameters can be updated.
18 STATE SPACE MODELS 84

18 State space models


18.1 Derivation of EM for LG-SSM
We directly work on the auxiliary function:
N
Y
Q(θ, θold ) =Ep(Z|Y,θold ) [log p(zn,1:Tn , yn,1:Tn |θ)]
n=1
N
X Tn
Y Tn
Y
=E[ log p(zn,1 ) p(zn,i |zn,i−1 ) p(yn,i |zn,i )]
n=1 i=2 i=1
N
X Tn
X
=E[ log N (zn,1 |µ0 , Σ0 ) + N (zn,i |Ai zn,i−1 + Bi ui , Qi )
n=1 i=2
Tn
X
+ N (yn,i |Ci zn,i + Di ui , Ri )]
i=1
( N
)
1 1X T −1
=E[N log 1 + − (zn,1 − µ0 ) Σ0 (zn,1 − µ0 )
|Σ0 | 2 2 n=1
T
X 1
+ Ni log 1

i=2
|Qi | 2
( Ni
)
1X T −1
+ − (zn,i − Ai zn,i−1 − Bi ui ) Qi (zn,i − Ai zn,i−1 − Bi ui ) ]
2 n=1
T
X 1
+ Ni log 1

i=2
|Ri | 2
( Ni
)
1X T −1
+ − (yn,i − Ci zn,i − Di ui ) Ri (yn,i − Ci zn,i − Di ui ) ]
2 n=1

When exchanging the order of sum over data, we have T = maxn {Tn }
and Ni denotes the number of data set with size no more than i.
To estimate µ0 , take the related terms:
N
1X
E[− (zn,1 − µ0 )Σ−1
0 (zn,1 − µ0 )]
2 n=1

Take derivative w.r.t µ0 :


N
X 1
E[ − µT0 Σ−1 −1
0 µ0 + zn,1 Σ0 µ0 ]
n=1
2
18 STATE SPACE MODELS 85

Setting it to zero yields:


1
µ0 = E[zn,1 ]
N
It is obvious that such estimation is similar to that for MVN with xn
replaced by E[zn,1 ]. This similarity works for other parameters as well. For
example, estimate Σ0 is tantamount to estimate the covariance of MVN
with data terms replaced.
Such analysis works for Qi and Ri as well. To estimate coefficient
matrix, we consider Ai firstly. The related term is:
Ni
X
T
ATi Q−1 T T −1

E[ zn,i i Ai zn,i − 2zn,i−1 Ai Qi (zn,i − Bi ui ) ]
n=1

Setting derivative to zero yields a solution similar to that for µ0 , the


same analysis can be applied for Bi , Ci , Di as well.

18.2 Seasonal LG-SSM model in standard form


From Fig.18.6(a), we have:
 
1 1 0 0TS−1

 0 T 
1 0 0S−1

A =  0 T

 0 1 0S−1  
0S−1 0S−1 I 0S−1
 
Qa 0TS+1
Qb 0TS
 
 0 
Q =  
 0 0 Q 0TS−1 

0(S−1)∗(S+2)
 
C = 1 1 1 0TS−1

Where we use 0n to denote a colomn vector of 0 with length n, and


0m∗n to denote a m ∗ n matrix of 0.
19 UNDIRECTED GRAPHICAL MODELS(MARKOV RANDOM FIELDS)86

19 Undirected graphical models(Markov


random fields)
19.1 Derivation of the log partition function
According to the definition:
XY
Z(θ) = ψc (yc |θc )
y c∈C

It is straightforward to give:
∂ log Z(θ) ∂ XY
= log ψc (yc |θc )
∂θc0 ∂θc0 y c∈C
1 X ∂ Y
= ψc (yc |θc )
Z(θ) y ∂θc0 c∈C
1 X Y ∂
= ψc (yc |θc ) ψc0 (yc0 |θc0 )
Z(θ) y c∈C,c6=c0 ∂θc0
1 X Y ∂
exp θcT0 φc0 (yc0 )

= ψc (yc |θc )
Z(θ) y c∈C,c6=c0 ∂θc0
1 XY
= ψc (yc |θc )φc0 (yc0 )
Z(θ) y c∈C
X 1 Y
= φc0 (yc0 ) ψc (yc |θ)
y
Z(θ) c∈C
X
= φc0 (yc0 )p(y|θ)
y

=E[φc0 (yc0 )|θ]

19.2 CI properties of Gaussian graphical models


Problem a:
We have:  
0.75 0.5 0.25
 
Σ=
 0.5 1.0 0.5 

0.25 0.5 0.75
19 UNDIRECTED GRAPHICAL MODELS(MARKOV RANDOM FIELDS)87

And:  
2 −1 0
Λ = Σ−1 = 
 
−1 2 −1

0 −1 2
Thus we have independency: X1 ⊥ X2 |X3 . This introduces a MRF
like:

X1 X3 X2

Problem b: The inverse of Σ contains no zero element, hence no con-


ditional independency. Therefore there have to be edges between any two
vertexes.

X1 X3

X2

This model also cancels the marginal independency X1 ⊥ X3 . But it


is possible to model this set of properties by Bayesian network with two
directed edges X1 → X2 and X3 → X2 .
Problem c: Consider the terms inside the exponential:
1 2
x1 + (x2 − x1 )2 + (x3 − x22 )


2
It is easy to see the precision matrix and covariance matrix take:
   
2 −1 0 1 1 1
   
Λ= −1 2 −1 , Σ = 1 2 2
 
0 −1 1 1 2 3

Problem d: The only independency is X1 ⊥ X3 |X2 :

X1 X2 X3
19 UNDIRECTED GRAPHICAL MODELS(MARKOV RANDOM FIELDS)88

19.3 Independencies in Gaussian graphical models


Problem a and b:
This PGM implies X1 ⊥ X3 |X2 , hence we are looking for a precision
matrix with Λ1,3 = 0, thus C and D meet the condition. On the other hand,
(A−1 )1,3 = (B −1 )1,3 = 0. So A and B are candidates for covariance matrix.
Problem c and d:
This PGM tells that X1 ⊥ X3 . Hence C and D can be covariance
matrix, A and B can be precision matrix.
The only possible PGM is:

X1 X2 X3

Problem e:
The answer can be derived from the conclusion of marginal Gaussian
directly, A is true while B not.

19.4 Cost of training MRFs and CRFs


The answer are generally:

O(r(N c + 1))

and
O(r(N c + N ))
19 UNDIRECTED GRAPHICAL MODELS(MARKOV RANDOM FIELDS)89

19.5 Full conditional in an Ising model


Straightforwardly(we have omitted θ from condition w.l.o.g):

p(xk = 1, x−k )
p(xk = 1|x−k ) =
p(x−k )
p(xk = 1, x−k )
=
p(xk = 0, x−k ) + p(xk = 1, x−k )
1
= p(xk =0,x−k )
1 + p(xk =1,x−k )
1
= Q
exp(hk ·0) <k,i> exp(Jk,i ·0)
1+ Q
exp(hk ·1) <k,i> exp(Jk,i ·xi )
n
X
=σ(hk + Jk,i xi )
i=1,i6=k

When using denotation x = {0, 1}, the full conditional becomes:


n
X
p(xk = 1|x−k )σ(2 · (hk + Jk,i xi ))
i=1,i6=k
20 EXACT INFERENCE FOR GRAPHICAL MODELS 90

20 Exact inference for graphical models


20.1 Variable elimination
Where tf is the figure?!

20.2 Gaussian times Gaussian is Gaussian


We have:

N (x|µ1 , λ−1 −1
1 ) × N N (x|µ2 , λ2 )
√  
λ1 λ2 λ1 λ2
= exp − (x − µ1 )2 − (x − µ2 )2
2π 2 2

λ1 µ21 + λ2 µ22
 
λ1 λ2 λ1 + λ2 2
= exp − x + (λ1 µ1 + λ2 µ2 )x −
2π 2 2

By completing the square:

λ1 µ21 + λ2 µ22
 
λ1 + λ2 2
exp − x + (λ1 µ1 + λ2 µ2 )x −
2 2
λ
=c · exp − (x − µ)2
2
Where:
λ = λ1 + λ2

µ = λ−1 (λ1 µ1 + λ2 µ2 )

The constant factor c can be obtained by computing the constant terms


inside the exponential.

20.3 Message passing on a tree


Problem a:
It is easy to see after variable elimination:
XX
p(X2 = 50) = p(G1 )p(G2 |G1 )p(X2 = 50|G2 )
G1 G2

X
p(G1 = 1, X2 = 50) = p(G1 ) p(G2 |G1 = 1)p(X2 = 50|G2 )
G2
20 EXACT INFERENCE FOR GRAPHICAL MODELS 91

Thus:
0.45 + 0.05 · exp(−5)
p(G1 = 1|X2 = 50) = ≈ 0.9
0.5 + 0.5 · exp(−5)
Problem b(here X denotes X2 or X3 ):

p(G1 = 1|X2 = 50, X3 = 50)


p(G1 = 1, X2 = 50, X3 = 50)
=
p(X2 = 50, X3 = 50)
p(G1 = 1)p(X2 |G1 = 1)p(X3 |G1 = 1)
=
p(G1 = 0)p(X2 |G1 = 0)p(X3 |G1 = 0) + p(G1 = 1)p(X2 |G1 = 1)p(X3 |G1 = 1)
p(X = 50|G1 = 1)2
=
p(X = 50|G1 = 0)2 + p(X = 50|G1 = 1)2
0.92
≈ 2 ≈ 0.99
0.1 + 0.92
Extra evidence makes the belief in G1 = 1 firmer.
Problem c:
The answer to problem c is symmetric to that to problem b, p(G1 =
0|X2 = 60, X3 = 60) ≈ 0.99.
Problem d:
Using the same pattern of analysis from Problem b, we have:

p(G1 = 1|X2 = 50, X3 = 60)


p(X = 50|G1 = 1)p(X = 60|G1 = 1)
=
p(X = 50|G1 = 0)p(X = 60|G1 = 0) + p(X = 50|G1 = 1)p(X = 60|G1 = 1)
Notice we have:

p(X = 50|G1 = 1) = p(X = 60|G1 = 0)

p(X = 50|G1 = 0) = p(X = 60|G1 = 1)

Hence:
P (G1 = 1|X2 = 50, X3 = 60) = 0.5

In this case, X2 and X3 have equal strength as evidence and their


effects achieve a balance so they provide not enough information to distort
the prior knowledge.
20 EXACT INFERENCE FOR GRAPHICAL MODELS 92

20.4 Inference in 2D lattice MRFs


Please refer to PGM:principals and techniques 11.4.1.
21 VARIATIONAL INFERENCE 93

21 Variational inference
21.1 Laplace approximation to p(µ, log σ|D) for a univari-
ate Gaussian
Laplace approximation equals representing f (µ, l) = log p(µ, l = log σ|D)
with second-order Taylor expansion. We have:

log p(µ, l|D) = log p(µ, l, D) − log p(D)


= log p(µ, l) + log p(D|µ, l) + c
= log p(D|µ, l) + c
N  
X 1 1
= log √ exp − 2 (yn − µ)2 + c
n=1 2πσ 2 2σ
N
X 1
= − N log σ + − (yn − µ)2 + c
n=1
2σ 2
N
1 1 X
=−N ·l+ (yn − µ)2 + c
2 exp {2 · l} n=1

Thus we derive:
N
∂ log p(µ, l|D) 1 1 X
= 2 · (yn − µ)
∂µ 2 exp {2 · l} n=1
N
= · (ȳ − µ)
σ2
N
∂ log p(µ, l|D) 1X 1
=−N + (yn − µ)2 · (−2) ·
∂l 2 n=1 exp {2 · l}
N
1 X
=−N + (yn − µ)2
σ 2 n=1
∂ 2 log p(µ, l|D) N
2
=− 2
∂µ σ
N
∂ 2 log p(µ, l|D) 2 X
= − (yn − µ)2
∂l2 σ 2 n=1
∂ 2 log p(µ, l|D) 1
=N · (ȳ − µ) · (−2) · 2
∂µ∂l σ
21 VARIATIONAL INFERENCE 94

For approximation, p(µ, l) ≈ N (µ, Σ) with:


!−1
∂ 2 log p(µ,l|D) ∂ 2 log p(µ,l|D)
∂µ2 ∂l2
Σ= ∂ 2 log p(µ,l|D) ∂ 2 log p(µ,l|D)
∂l2 ∂µ∂l
!
∂ log p(µ,l|D)
∂µ
µ=Σ ∂ log p(µ,l|D)
∂l

21.2 Laplace approximation to normal-gamma


This is the same with exercise 21.1 when the prior is uniformative. We
formally substitute:
N
X N
X
2
(yn − µ) = ((yn − ȳ) − (µ − ȳ))2
n=1 n=1
N
X N
X N
X
2 2
= (yn − ȳ) + (µ − ȳ) + 2(µ − ȳ) · (yn − ȳ)
n=1 n=1 n=1

=N s2 + N (µ − ȳ)2
PN
Where s2 = N1 n=1 (yn − ȳ)2
Conclusions in all problems a, b and c are included in the previous
solution.

21.3 Variational lower bound for VB for univariate Gaus-


sian
What left in section 21.5.1.6 is the derivation for 21.86 to 21.91. We
omit the derivation for entropy for Gaussian and moments, which can be
found in any information theory textbook. Now we derive the E[ln x|x ∼
Ga(a, b)], which can therefore yields to the entropy for a Gamma distribu-
tion.
We know that Gamma distribution is an exponential family distribu-
tion:
ba a−1
Ga(x|a, b) = x exp {−b · x}
Γ(a)
∝ exp {−b · x + (a − 1) ln x}
= exp φ(x)T θ

21 VARIATIONAL INFERENCE 95

The sufficient statistics is φ(x) = (x, ln x)T and natural parameter is


given by θ = (−b, a − 1)T . Thus Gamma distribution can be seen as the
maximum entropy distribution under constraints on x and ln x.
The culumant function is given by:

A(θ) = log Z(θ)


Γ(a)
= log
ba
= log Γ(a) − a log b

The expectation of sufficient statistics is given by the derivative of


cumulant function, therefore:
∂A Γ0 (a)
E[ln x] = = − log b
∂(a − 1) Γ(a)
Γ0 (a)
According to defintion ψ(a) = Γ(a)
:

E[ln x] = ψ(a) − log b

The rest derivations are completed or trivial.

21.4 Variational lower bound for VB for GMMs


The lower bound is given by:
p(θ, D)
Eq [log ] =Eq [log p(θ, D)] − Eq [q(θ)]
q(θ)
=Eq [log p(D|θ)] + Eq [log p(θ)] + Eq [log q(θ)]
=E[log p(x|z, µ, Λ, π)] + E[log p(z, µ, Λ, π)]
− E[log q(z, µ, Λ, π)]
=E[log p(x|z, µ, Λ, π)] + E[log p(z|π)] + E[log p(π)] + E[log p(µ, Λ)]
+ E[log q(z)] + E[log q(π)] + E[log q(µ, Λ)]

We are now showing 21.209 to 21.215.


For 21.209:

E[log p(x|z, µ, Λ)] =Eq(z)q(µ,Λ) [log p(x|z, µ, Λ)]


XX D 1 1
= Eq(z)q(µ,Λ) [− log 2π + log |Λk | − (xn − µk )T Λk (xn − µk )]
n k
2 2 2
21 VARIATIONAL INFERENCE 96

Using 21.132 and converting summing by average x̄k yields to solution.


For 21.210:

E[log p(z|π)] =Eq(z)q(π) [log p(z|π)]


N Y
Y K
=Eq(z)q(π) [log πkznk ]
n=1 k=1
N X
X K
= Eq(z)q(π) [znk log πk ]
n=1 k=1
N X
X K
= Eq(z) [znk ]Eq(π) [log πk ]
n=1 k=1
N X
X K
= rnk log π̄k
n=1 k=1

For 21.211:

E[log p(π)] =Eq(π) [log p(π)]


K
Y
=Eq(π) [log(C · πkα0 −1 )]
k=1
K
X
= ln C + (α0 − 1) log π̄k
k=1

For 21.212:

E[log p(µ, Λ)] =Eq(µ,Λ) [log p(µ, Λ)]


K
Y
=Eq(µ,Λ) [log W i(Λk |L0 , v0 ) · N (µk |m0 , (β0 Λk )−1 ]
k=1
K
X 1 1 
Eq(µ,Λ) [log C + (v0 − D − 1) log |Λk | − tr Λk L−1

= 0
k=1
2 2
D 1 1
− log 2π − log |β0 Λk | − (µk − m0 )T (β0 Λk )(µk − m0 )]
2 2 2
Using 21.131 to expand the expected value of the quadratic form and
using the fact that the mean of a Wi distribution is vk Lk and we are done.
21 VARIATIONAL INFERENCE 97

For 21.213:

E[log q(z)] =Eq(z) [log q(z)]


XX
=Eq(z) [ zik log rik ]
i k
XX
= Eq(z) [zik ] log rik
i k
XX
= rik log rik
i k

For 21.214:

E[log q(π)] =Eq(π) [log q(π)]


K
X
=Eq(π) [log C + (αk − 1) log πk ]
k=1
X
= log C + (αk − 1) log π̄k
k

For 21.215:

E[log q(µ, Λ)] =Eq(µ,Λ) [log q(µ, Λ)]


X D 1
= Eq(µ,Λ) [log q(Λk ) − log 2π + log |βk Λk |
k
2 2
1
− (µk − mk )T (βk Λk )(µk − mk )]
2
Using 21.132 to expand the quadratic form to give E[(µk −mk )T (βk Λk )(µk −
mk )] = D

21.5 Derivation of E[log πk ]


under a Dirichlet distribution Dirichlet distribution is an exponential
family distribution, we have:

φ(π) = (log π1 , log π2 , ... log πK )

θ=α

The cumulant function is:


K
X XK
A(α) = log B(α) = log Γ(αi ) − log Γ( αi )
i=1 i=1
21 VARIATIONAL INFERENCE 98

And:
PK K
∂A(α) Γ0 (αk ) Γ0 ( i=1 αk ) X
E[log πk ] = = − PK = ψ(αk ) − ψ( αi )
∂αk Γ(αk ) Γ( i=1 αk ) i=1

Take exponential on both sides:


K
X exp(αk )
exp(E[log πk ]) = exp(ψ(αk ) − ψ( αk )) = PK
i=1 exp( i=1 αi )

21.6 Alternative derivation of the mean field updates for


the Ising model
This is no different than applying the procedure in section 21.3.1 before
derivating updates, hence omitted.

21.7 Forwards vs reverse KL divergence


We have:
p(x, y)
KL(p(x, y)||q(x, y)) =Ep(x,y) [log ]
q(x, y)
X X X
= p(x, y) log p(x, y) − p(x, y) log q(x) − p(x, y) log q(y)
x,y x,y x,y
X XX X X
= p(x, y) log p(x, y) − ( p(x, y)) log q(x) − y( p(x, y)) log q(q)
x,y x y x

=H(p(x, y)) − H(p(x)) − H(p(y)) + KL(p(x)||q(x)) + KL(p(y)||q(y))


=constant + KL(p(x)||q(x)) + KL(p(y)||q(y))

Thus the optimal approximation is q(x) = p(x) and q(y) = p(y).


We skip the practical part.

21.8 Derivation of the structured mean field updates for


FHMM
According to the conclusion from mean-field varitional methods, we
have:
E(xm ) = Eq/m [E(p̄(xm ))]
21 VARIATIONAL INFERENCE 99

Thus:
K
T X T M M
X 1 X X
T −1
X
− xt,m,k ˜t,m,k = E[ (yt − Wl xt,m ) Σ (yt − Wl xt,m )] + C
t=1 k=1
2 t=1 l6=m l6=m

Comparing the coefficient of xt,m,k (i.e. setting xt,m,k to 1) ends in:

T −1
X 1 T −1
˜t,m,k = Wm Σ (yt − Wl E[xt,l ]) − (Wm Σ Wm )k,k
l6=m
2

Write into matrix form yields to 21.62.

21.9 Variational EM for binary FA with sigmoid link


Refer to ”Probabilistic Visualisation of High-Dimensional Binary Data,
Tipping, 1998”.

21.10 VB for binary FA with probit link


The major difference in using probit link is the uncontinuous likelihood
caused by p(yi = 1|zi ) = I(zi > 0). In the context of hiding X, we assume
Gaussian prior on X, W and Z. The approximation takes the form:
L
Y N
Y
q(X, Z, W) = q(wl ) q(xi )q(zi )
l=1 i=1

It is a mean-field approximation, hence in an algorithm similari to EM,


we are to update the distribution of X, Z and W stepwise.
For variable X, we have:

log q(xi ) =Eq(zi )q(w) [log p(xi , w, zi , yi )]


=Eq(zi )q(w) [log p(xi ) + log p(w) + log p(zi |wi , w) + log p(yi |zi )]

Given the likelihood form, for i corresponding to yi = 1, q(zi ) have


to be a truncated one, i.e. we only consider the expectations in the form
E[z|z > µ] and E[z 2 |z > µ].
log q(xi ) = − 21 xTi Λ1 xi − 12 E[z 2 ] − 21 xTi E[wwT ]xi + E[z]E[w]T xi
Where Λ1 is the covariance of xi ’s prior distribution, E[wwT ] can be
calculated given the Gaussian form of q(w), and truncated expectations E[z]
21 VARIATIONAL INFERENCE 100

and E[z 2 ] can be obtained from solutions to exercise 11.15. It is obvious


that q(xi ) is a Gaussian.
The update for w is similar to that for xi as long as they play symmetric
roles in likelihood. The only difference is we have to sum over i when
updating w.
At last we update zi :

log q(zi ) = Eq(xi )q(w) [log p(zi |xi , w) + log p(yi |zi )]

Inside the expectation we have:


1
− zi2 + E[w]T E[x]zi + c
2
Therefore q(zi ) again takes a Gaussian form.

You might also like