Murphy Book Solution

Machine Learning: A Probabilistic
Perspective Solution Manual Version 1.1

Fangqi Li, SJTU
Contents
1 Introduction 2
1.1 Constitution of this document . . . . . . . . . . . . . . . . . . 2
1.2 On Machine Learning: A Probabilistic Perspective . . . . . . 2
1.3 What is this document? . . . . . . . . . . . . . . . . . . . . . 3
1.4 Updating log . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Probability 6
2.1 Probability are sensitive to the form of the question that was
used to generate the answer . . . . . . . . . . . . . . . . . . . 6
2.2 Legal reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Vriance of a sum . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Bayes rule for medical diagnosis . . . . . . . . . . . . . . . . . 7
2.5 The Monty Hall problem(The dilemma of three doors) . . . . 7
2.6 Conditional Independence . . . . . . . . . . . . . . . . . . . . 7
2.7 Pairwise independence does not imply mutual independence . 8
2.8 Conditional independence iff joint factorizes . . . . . . . . . . 8
2.9 Conditional independence* . . . . . . . . . . . . . . . . . . . 9
2.10 Deriving the inverse gamma density . . . . . . . . . . . . . . 9
2.11 Normalization constant for a 1D Gaussian . . . . . . . . . . . 9
2.12 Expressing mutual information in terms of entropies . . . . . 10
2.13 Mutual information for correlated normals . . . . . . . . . . . 10
2.14 A measure of correlation . . . . . . . . . . . . . . . . . . . . . 10
1
CONTENTS 2
2.15 MLE minimizes KL divergence to the empirical distribution . 11

2.16 Mean, mode, variance for the beta distribution . . . . . . . . 11
2.17 Expected value of the minimum . . . . . . . . . . . . . . . . . 12
3 Generative models for discrete data 13

3.1 MLE for the Beroulli/binomial model . . . . . . . . . . . . . 13
3.2 Marginal likelihood for the Beta-Bernoulli model . . . . . . . 13
3.3 Posterior predictive for Beta-Binomial model . . . . . . . . . 14
3.4 Beta updating from censored likelihood . . . . . . . . . . . . 14
3.5 Uninformative prior for log-odds ratio . . . . . . . . . . . . . 14
3.6 MLE for the Poisson distribution . . . . . . . . . . . . . . . . 15
3.7 Bayesian analysis of the Poisson distribution . . . . . . . . . 15
3.8 MLE for the uniform distrbution . . . . . . . . . . . . . . . . 15
3.9 Bayesian analysis of the uniform distribution . . . . . . . . . 16
3.10 Taxicab problem* . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.11 Bayesian analysis of the exponential distribution . . . . . . . 16
3.12 MAP estimation for the Bernoulli with non-conjugate priors* 17
3.13 Posterior predictive distribution for a batch of data with the
dirichlet-multinomial model . . . . . . . . . . . . . . . . . . . 17
3.14 Posterior predictive for Dirichlet-multinomial* . . . . . . . . . 17
3.15 Setting the hyper-parameters I* . . . . . . . . . . . . . . . . . 17
3.16 Setting the beta hyper-parameters II . . . . . . . . . . . . . . 17
3.17 Marginal likelihood for beta-binomial under uniform prior . . 18
3.18 Bayes factor for coin tossing* . . . . . . . . . . . . . . . . . . 18
3.19 Irrelevant features with naive Bayes . . . . . . . . . . . . . . 18
3.20 Class conditional densities for binary data . . . . . . . . . . . 20
3.21 Mutual information for naive Bayes classifiers with binary
features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.22 Fitting a naive Bayesian spam filter by hand* . . . . . . . . . 21
4 Gaussian models 22
4.1 Uncorrelated does not imply independent . . . . . . . . . . . 22
4.2 Uncorrelated and Gaussian does not imply independent un-
less jointly Gaussian . . . . . . . . . . . . . . . . . . . . . . . 22
CONTENTS 3
4.3 Correlation coefficient is between -1 and 1 . . . . . . . . . . . 22

4.4 Correlation coefficient for linearly related variables is 1 or -1 . 23
4.5 Normalization constant for a multidimensional Gaussian . . . 23
4.6 Bivariate Gaussian . . . . . . . . . . . . . . . . . . . . . . . . 23
4.7 Conditioning a bivariate Gaussian . . . . . . . . . . . . . . . 23
4.8 Whitening vs standardizing . . . . . . . . . . . . . . . . . . . 24
4.9 Sensor fusion with known variances in 1d . . . . . . . . . . . 24
4.10 Derivation of information form formulae for marginalizing
and conditioning . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.11 Derivation of the NIW posterior . . . . . . . . . . . . . . . . 25
4.12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.13 Gaussian posterior credible interval . . . . . . . . . . . . . . . 26
4.14 MAP estimation for 1d Gaussians . . . . . . . . . . . . . . . . 27
4.15 Sequential(recursive) updating of covariance matrix . . . . . . 28
4.16 Likelihood ratio for Gaussians . . . . . . . . . . . . . . . . . . 28
4.17 LDA/QDA on height/weight data . . . . . . . . . . . . . . . 28
4.18 Naive Bayes with mixed features . . . . . . . . . . . . . . . . 28
4.19 Decision boundary for LDA with semi tied covariances . . . . 29
4.20 Logistic regression vs LDA/QDA . . . . . . . . . . . . . . . . 29
4.21 Gaussian decision boundaries . . . . . . . . . . . . . . . . . . 30
4.22 QDA with 3 classes . . . . . . . . . . . . . . . . . . . . . . . . 30
4.23 Scalar QDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5 Bayesian statistics 31
5.1 Proof that a mixture of conjugate priors is indeed conjugate . 31
5.2 Optimal threshold on classification probability . . . . . . . . 31
5.3 Reject option in classifiers . . . . . . . . . . . . . . . . . . . . 31
5.4 More reject options . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5 Newsvendor problem . . . . . . . . . . . . . . . . . . . . . . . 32
5.6 Bayes factors and ROC curves . . . . . . . . . . . . . . . . . 32
5.7 Bayes model averaging helps predictive accuracy . . . . . . . 32
5.8 MLE and model selection for a 2d discrete distribution . . . . 33
5.9 Posterior median is optimal estimate under L1 loss . . . . . . 34
5.10 Decision rule for trading off FPs and FNs . . . . . . . . . . . 34
CONTENTS 4
6 Frequentist statistics 35
7 Linear regression 36
7.1 Behavior of training set error with increasing sample size . . 36
7.2 Multi-output linear regression . . . . . . . . . . . . . . . . . . 36
7.3 Centering and ridge regression . . . . . . . . . . . . . . . . . 36
2
7.4 MlE for σ for linear regression . . . . . . . . . . . . . . . . . 37
7.5 MLE for the offset term in linear regression . . . . . . . . . . 37
7.6 MLE for simple linear regression . . . . . . . . . . . . . . . . 38
7.7 Sufficient statistics for online linear regression . . . . . . . . . 38
7.8 Bayesian linear regression in 1d with known σ 2 . . . . . . . . 38
7.9 Generative model for linear regression . . . . . . . . . . . . . 39
7.10 Bayesian linear regression using the g-prior . . . . . . . . . . 40
8 Logistic regression 42
8.1 Spam classification using logistic regression . . . . . . . . . . 42
8.2 Spam classification using naive Bayes . . . . . . . . . . . . . . 42
8.3 Gradient and Hessian of log-likelihood for logistic regression . 42
8.4 Gradient and Hessian of log-likelihood for multinomial logis-
tic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.5 Symmetric version of l2 regularized multinomial logistic re-
gression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.6 Elementary properties of l2 regularized logistic regression . . 44
8.7 Regularizing separate terms in 2d logistic regression . . . . . 45
9 Generalized linear models and the exponential family 46

9.1 Conjugate prior for univariate Gaussian in exponential family
form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
9.2 The MVN is in the exponential family . . . . . . . . . . . . . 47
10 Directed graphical models(Bayes nets) 48
11 Mixture models and the EM algorithm 49

11.1 Student T as infinite mixture of Gaussian . . . . . . . . . . . 49
11.2 EM for mixture of Gaussians . . . . . . . . . . . . . . . . . . 49
CONTENTS 5
11.3 EM for mixtures of Bernoullis . . . . . . . . . . . . . . . . . . 50

11.4 EM for mixture of Student distributions . . . . . . . . . . . . 51
11.5 Gradient descent for fitting GMM . . . . . . . . . . . . . . . 52
11.6 EM for a finite scale mixture of Gaussians . . . . . . . . . . . 53
11.7 Manual calculation of the M step for a GMM . . . . . . . . . 54
11.8 Moments of a mixture of Gaussians . . . . . . . . . . . . . . . 54
11.9 K-means clustering by hand . . . . . . . . . . . . . . . . . . . 55
11.10 Deriving the K-means cost function . . . . . . . . . . . . . . 55
11.11 Visible mixtures of Gaussians are in exponential family . . . 56
11.12 EM for robust linear regression with a Student t likelihood . 56
11.13 EM for EB estimation of Gaussian shrinkage model . . . . . 57
11.14 EM for censored linear regression . . . . . . . . . . . . . . . 57
11.15 Posterior mean and variance of a truncated Gaussian . . . . 57
12 Latent linear models 59

12.1 M-step for FA . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
12.2 MAP estimation for the FA model . . . . . . . . . . . . . . . 60
12.3 Heuristic for assessing applicability of PCA* . . . . . . . . . . 61
12.4 Deriving the second principal component . . . . . . . . . . . . 61
12.5 Deriving the residual error for PCA . . . . . . . . . . . . . . 62
12.6 Derivation of Fisher’s linear discriminant . . . . . . . . . . . . 62
12.7 PCA via successive deflation . . . . . . . . . . . . . . . . . . . 62
12.8 Latent semantic indexing . . . . . . . . . . . . . . . . . . . . 63
12.9 Imputation in a FA model* . . . . . . . . . . . . . . . . . . . 63
12.10 Efficiently evaluating the PPCA density . . . . . . . . . . . 63
12.11 PPCA vs FA . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
13 Sparse linear models 64

13.1 Partial derivative of the RSS . . . . . . . . . . . . . . . . . . 64
13.2 Derivation of M-step for EB for linear regression . . . . . . . 64
13.3 Derivation of fixed point updates for EB for linear regression* 66
13.4 Marginal likelihood for linear regression* . . . . . . . . . . . . 66
13.5 Reducing elastic net to lasso . . . . . . . . . . . . . . . . . . . 66
13.6 Shrinkage in linear regression . . . . . . . . . . . . . . . . . . 66
CONTENTS 6
13.7 Prior for the Bernoulli rate parameter in the spike and slab
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
13.8 Deriving E step for GSM prior . . . . . . . . . . . . . . . . . 68
13.9 EM for sparse probit regression with Laplace prior . . . . . . 68
13.10 GSM representation of group lasso* . . . . . . . . . . . . . . 70
13.11 Projected gradient descent for l1 regularized least squares . 70
13.12 Subderivative of the hinge loss function . . . . . . . . . . . . 71
13.13 Lower bounds to convex functions . . . . . . . . . . . . . . . 71
14 Kernels 72
15 Gaussian processes 73
15.1 Reproducing property . . . . . . . . . . . . . . . . . . . . . . 73
16 Adaptive basis function models 74

16.1 Nonlinear regression for inverse dynamics . . . . . . . . . . . 74
17 Markov and hidden Markov models 75

17.1 Derivation of Q function for HMM . . . . . . . . . . . . . . . 75
17.2 Two filter approach to smoothing in HMMs . . . . . . . . . . 75
17.3 EM for HMMs with mixture of Gaussian observations . . . . 76
17.4 EM for HMMs with tied mixtures . . . . . . . . . . . . . . . . 77
18 State space models 78

18.1 Derivation of EM for LG-SSM . . . . . . . . . . . . . . . . . . 78
18.2 Seasonal LG-SSM model in standard form . . . . . . . . . . . 79
19 Undirected graphical models(Markov random fields) 80

19.1 Derivation of the log partition function . . . . . . . . . . . . . 80
19.2 CI properties of Gaussian graphical models . . . . . . . . . . 80
19.3 Independencies in Gaussian graphical models . . . . . . . . . 82
19.4 Cost of training MRFs and CRFs . . . . . . . . . . . . . . . . 82
19.5 Full conditional in an Ising model . . . . . . . . . . . . . . . . 83
CONTENTS 7
20 Exact inference for graphical models 84

20.1 Variable elimination . . . . . . . . . . . . . . . . . . . . . . . 84
20.2 Gaussian times Gaussian is Gaussian . . . . . . . . . . . . . . 84
20.3 Message passing on a tree . . . . . . . . . . . . . . . . . . . . 84
20.4 Inference in 2D lattice MRFs . . . . . . . . . . . . . . . . . . 86
21 Variational inference 87
21.1 Laplace approximation to p(µ, log σ|D) for a univariate Gaus-
sian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
21.2 Laplace approximation to normal-gamma . . . . . . . . . . . 88
21.3 Variational lower bound for VB for univariate Gaussian . . . 88
21.4 Variational lower bound for VB for GMMs . . . . . . . . . . . 89
21.5 Derivation of E[log πk ] . . . . . . . . . . . . . . . . . . . . . . 91
21.6 Alternative derivation of the mean field updates for the Ising
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
21.7 Forwards vs reverse KL divergence . . . . . . . . . . . . . . . 92
21.8 Derivation of the structured mean field updates for FHMM . 92
21.9 Variational EM for binary FA with sigmoid link . . . . . . . . 93
21.10 VB for binary FA with probit link . . . . . . . . . . . . . . . 93
1 INTRODUCTION 8
1 Introduction
1.1 Constitution of this document
Here we should have demonstrated the solution to problems in Chapter
One in Machine Learning, A Probabilistic Perspective(MLAPP). Since the
number of problem in Chapter is zero, we save this section as an introduction
to this document, i.e.a solution manual.
This document provides detailed solution to almost all problems of
textbook MLAPP from Chapter One to Chapter Fourteen(Chinese version)
/ Twenty-one(English version). We generally save the restatement of prob-
lems for readers themselves.
There are two class for problems in MLAPP: theortical inference and
pratical projects. We provide solution to most inference problems apart
from those which are nothing but straightforward algebra(and few which
we failed to solve). Practical problems, which base on a Matlab toolbox,
are beyond the scope of this document.
1.2 On Machine Learning: A Probabilistic Perspective

Booming studies and literatures have made the boundary of ”machine
learning” vague.
On one hand, the rapid development of AI technology has kept the
society shocked, which also results in sharply increase in number of students
who would try to take related courses in colleges. On the other hand,
some scholars are still uncertain in learning-related theories, especially deep
learning.
The extraordinary achievements of machine learning in recent days of-
ten make one forget that this discipline has undergone a long evolution and
whose establishment dates back at least to the studies of ”electronic brains”
in the 1940s. Be that as it may, machine learning has not been defined as
a ”closed” theory. Even in the some community of researchers, machine
learning are crowned metaphysics or alchemistry. Personally, I believe tha
being called metaphysics is a common experience shared by many branches
of theory which are undergoing the most rapid development.
1 INTRODUCTION 9
To be a completed theory, machine learning is still looking for a way

to conclude itself in a closed system. The most successful attempt so far
has been the one based on probability. As commented by David Blei from
Princton on the back of MLAPP: ”In Machine Learning, the language of
probability and statistics reveals important connections between seemingly
disparate algorithms and strategies. Thus, its readers will become articulate
in a holistic view of the state-of-art and poised to build the next generation
of machine learning algorithms.”
The crucial idea in MLAPP is: machine learning is tantamount to
Bayesian Statistics, which draws connections between numerous ”indepe-
dent” algorithms. But the history of Bayesian statistics(which can be traced
back to days of Laplace) outlengths the one of machinea learning a lot.
MLAPP is not noval in holding such an idea. C.M.Bishop’s Pattern Recog-
nition and Machine Learning is another typical example. Both of them are
considered as classical textbooks in elementary machine learning.
In general, MLAPP reduces the difficulty of the entire book at the
expense of partially deduced completeness(for the first seven chapters). It
covers a wider range of models and is suitable for those with background
in mathemathcal tools. The chapters that concerning classical probabilistic
models (e.g, chapter 2, 3, 4, 5, 7, 8, 11, 12) is comparable to PRML. But
due to the reordering and more details, they worth a read for one who have
finished reading PRML.
1.3 What is this document?

The motivation for writing this document is that I need to read text-
book MLAPP after selecting machine learning course, but I failed to find
any free compiled solution manuals. Although several Github projects have
started working on it, the velocity has been too slow. Also I want to focus
more on the theoretical part of the text rather than the implementation
code.
Hence I began working on this document. It is completed(first version,
Chapter One to Chapter Fourteen) within the first two weeks before the
official semester. Bacase of the hurry process, it is suggested that readers
1 INTRODUCTION 10
should read from a critical perspective and not hesitate to believe in ev-
erything I have written down. In the end, I hope that readers can provide
comments and revise opinions. Apart from correcting the wrong answers,
those who good at using MATLAB, Latex typesetting or those who are will-
ing to participate in the improvement of the document are always welcome
to contact me.
22/10/2017
Fangqi Li
Munich, Germany
[email protected]
[email protected]
1 INTRODUCTION 11
1.4 Updating log

22/10/2017(First Chinese compilation)
02/03/2018(English compilation)
2 PROBABILITY 12
2 Probability
2.1 Probability are sensitive to the form of the question
that was used to generate the answer
Denote two children by A and B.
Use
E1 : A = boy, B = girl
E2 : B = boy, A = girl
E3 : A = boy, B = boy
In question a:
1
P (E1 ) = P (E2 ) = P (E3 ) =
4
P (E1 ) + P (E2 ) 2
P (OneGirl|OneBoy) = =
P (E1 ) + P (E2 ) + P (E3 ) 3
For question b,w.l.o.g, assume child A is seen:
1
P (B = girl|A = boy) =
2
2.2 Legal reasoning

Let E1 and E2 denote the event ”the defendant commited the crime”
and ”the defendant has special blood type” respectively. Thus:
p(E1 , E2 ) p(E2 |E1 )p(E1 )
p(E1 |E2 ) = =
p(E2 ) p(E2 )
1
1 · 800000 1
= 1 =
8000
100
2.3 Vriance of a sum

Calculate this straightforwardly:
var[X + Y ] =E[(X + Y )2 ] − E2 [X + Y ]
=E[X 2 ] − E2 [X] + E[Y 2 ] − E2 [Y ] + 2E[XY ] − 2E2 [XY ]
=var[X] + var[Y ] + 2cov[X, Y ]
2 PROBABILITY 13
2.4 Bayes rule for medical diagnosis

Applying Bayes’s rules:
P (ill)P (positive|ill)
P (ill|positive) =
P (ill)P (positive|ill) + P (health)P (positive|health)
=0.0098
2.5 The Monty Hall problem(The dilemma of three doors)

The answer is b. Applying Bayes’s rules:
P (choose1 )P (prize1 )P (choose3 |prize1 , choose1 )
P (prize1 |choose1 , open3 ) =
P (choose1 )P (open3 |choose1 )
P (prize1 )P (choose3 |prize1 , choose1 )
=
P (open3 |choose1 )
1 1
· 1
= 1 1 31 2 1 =
· + 3 ·0+ 3 ·1
3 2
3
In the last step we summarize over the potential location of the prize.
2.6 Conditional Independence

In question a, we have:
p(H)p(e1 , e2 |H)
p(H|e1 , e2 ) =
p(e1 , e2 )
Thus the answer is (ii).

For question, we have further decomposition:
p(H)p(e1 |H)p(e2 |H)
p(H|e1 , e2 ) =
p(e1 , e2 )
So both (i) and (ii) and sufficient obviously. Since:

X
p(e1 , e2 ) = p(e1 , e2 , H)
H
X
= p(H)p(e1 |H)p(e2 |H)
H
(iii) is sufficint as well since we can calculate p(e1 , e2 ) independently.

2 PROBABILITY 14
2.7 Pairwise independence does not imply mutual inde-

pendence
Let’s assmue three boolean variables x1 , x2 , x3 , x1 and x2 have values
of 0 or 1 with equal possibility independently and x3 = XOR(x1 , x2 ):
It is easy to prove that x3 is independent with x1 or x2 , but given
both x1 and x2 , the value of x3 is determined and thereby the mutual
independence fails.
2.8 Conditional independence iff joint factorizes

We prove 2.129 is equal to 2.130.
By denoting:
g(x, z) = p(x|z)
h(y, z) = p(y|z)
We have the first half of proof.

Secondly we have:
X
p(x|z) = p(x, y|z)
y
X
= g(x, z)h(y, z)
y
X
=g(x, z) h(y, z)
y
X
p(y|z) =h(y, z) g(x, z)
x
And:
X
1= p(x, y|z)
x,y
X X
=( g(x, z))( h(y, z))
x y
2 PROBABILITY 15
Thus:
X X
p(x|z)p(y|z) =g(x, z)h(y, z)( g(x, z))( h(y, z))
x y
=g(x, z)h(y, z)
=p(x, y|z)
2.9 Conditional independence*

From a graphic view, both arguments are correct. But from a general
view, both of them do not have a decomposition form, thus false.
2.10 Deriving the inverse gamma density

According to:
dx
p(y) = p(x)| |
dy
We easily have:
IG(y) =Ga(x) · y −2
ba 1 (a−1)+2 − yb
= ( ) e
Γ(a) y
ba b
= (y)−(a+1) e− y
Γ(a)
2.11 Normalization constant for a 1D Gaussian

This proof should be found around any textbook about multivariable
calculus.Omitted here.
2 PROBABILITY 16
2.12 Expressing mutual information in terms of entropies
X p(x, y)
I(X; Y ) = p(x, y) log
x,y
p(x)p(y)
X p(x|y)
= p(x, y) log
x,y
p(x)
X XX
= p(x, y) log p(x|y) − ( p(x, y)) log p(x)
x,y x y
= − H(X|Y ) + H(X)
Inversing X and Y yields to another formula..
2.13 Mutual information for correlated normals

We have:
I(X1 ; X2 ) =H(X1 ) − H(X1 |X2 )

=H(X1 ) + H(X2 ) − H(X1 , X2 )
1 1 1
= log 2πσ 2 + log 2πσ 2 + log(2π)2 σ 4 (1 − ρ2 )
2 2 2
1
= − log(1 − ρ2 )
2
(refer to Elements of Information Theory,Example 8.5.1)
2.14 A measure of correlation

In question a:
H(Y |X) H(X) − H(Y |X)
r =1 − =
H(X) H(X)
H(Y ) − H(Y |X)
=
H(X)
I(X; Y )
=
H(X)
We have 0 ≤ r ≤ 1 in question b for I(X; Y ) > 0 and H(X|Y ) <
H(X)(properties of entropy).
r = 0 iff X and Y are independent.
r = 1 iff X is determined(not necassary equal) by Y .
2 PROBABILITY 17
2.15 MLE minimizes KL divergence to the empirical dis-

tribution
Expand the KL divergence:
θ = arg min {KL(pemp ||q(θ))}

θ

pemp
= arg min Epemp [log ]
θ p(θ)
( )
X
= arg min −H(pemp ) − (log q(x; θ))
θ
x∈dataset
( )
X
= arg max log p(x; θ)
θ
x∈dataset
We use the weak law of large numbers in the third step and drop the
entropy of empirical distribution in the last step.
2.16 Mean, mode, variance for the beta distribution

Firstly, derive the mode for beta distribution by differentiating the pdf:
d a−1
x (1 − x)b−1 = [(1 − x)(a − 1) − (b − 1)x]xa−2 (1 − x)b−2
dx
Setting this to zero yields:
a−1
mode =
a+b−2
Secondly, derive the moment in beta distribution:
Z
1
N
E[x ] = xa+N −1 (1 − x)b−1 dx
B(a, b)
B(a + N, b)
=
B(a, b)
Γ(a + N )Γ(b) Γ(a + b)
=
Γ(a + N + b) Γ(a)Γ(b)
Setting N = 1, 2:
a
E[x] =
a+b
a(a + 1)
E[x2 ] =
(a + b)(a + b + 1)
2 PROBABILITY 18
Where we have used the property of Gamma function. Straightforward

algebra gives:
a
mean = E[x] =
a+b
ab
variance = E[x2 ] − E2 [x] =
(a + b)2 (a + b + 1)
2.17 Expected value of the minimum

Let m denote the location of the left most point, we have:
p(m > x) =p([X > x]and[Y > x])

=p(X > x)p(Y > x)
=(1 − x)2
Therefore:
Z
E[m] = x · p(m = x)dx
Z
= p(m > x)dx
Z 1
= (1 − x)2 dx
0
1
=
3
3 GENERATIVE MODELS FOR DISCRETE DATA 19
3 Generative models for discrete data

3.1 MLE for the Beroulli/binomial model
Likelihood:
p(D|θ) = θN1 (1 − θ)N0
Log-Likelihood:
ln p(D|θ) = N1 ln θ + N0 ln(1 − θ)
Setting the derivative to zero:

∂ N1 N0
ln p(D|θ) = − =0
∂θ θ 1−θ
This ends in 3.22:
N1 N1
θ= =
N1 + N0 N
3.2 Marginal likelihood for the Beta-Bernoulli model

Likelihood:
p(D|θ) = θN1 (1 − θ)N0
Prior distribution:
p(θ|a, b) = Beta(θ|a, b) = θa−1 (1 − θ)b−1
Posterior distribution:
p(θ|D) ∝p(D|θ) · p(θ|a, b)

=θN1 +a−1 (1 − θ)N0 +b−1
=Beta(θ|N1 + a, N0 + b)
Prediction:
Z
p(xnew = 1|D) = p(xnew = 1|θ) · p(θ|D)dθ
Z
= θp(θ|D)dθ
N1 + a
=E(θ) =
N1 + a + N0 + b
Calcualte p(D) where D = 1, 0, 0, 1, 1:
p(D) =p(x1 )p(x2 |x1 )p(x3 |x2 , x1 )...p(XN |xN −1 , XN −2 , ...X1 )

a b b+2 a+1 a+2
=
a+ba+b+1a+b+2a+b+3a+b+4
Denote α = a + b, α1 = a, α0 = b, we have 3.83. To derive 3.80, we
make use of:
(α1 + N1 − 1)! Γ(α1 + N1 )
[(α1 )..(α1 + N1 − 1)] = =
(α1 − 1)! Γ(α1 )
3.3 Posterior predictive for Beta-Binomial model

Straightforward algebra:
B(α10 + 1, α00 )
Bb(α10 , α00 , 1) =
B(α10 , α00 )
Γ(α00 + α10 ) Γ(α10 + 1)
=
Γ(α00 + α10 + 1) Γ(α10 )
α0
= 0 1 0
α1 + α0
3.4 Beta updating from censored likelihood

The derivation is straightforward:
p(θ, X < 3) =p(θ)p(X < 3|θ)

=p(θ)(p(X = 1|θ) + p(X = 2|θ))
=Beta(θ|1, 1)(Bin(1|5, θ) + Bin(2|5, θ))
3.5 Uninformative prior for log-odds ratio

Since:
θ
φ = log
1−θ
By using change of variables formula:
dφ 1
p(θ) = p(φ)| |∝
dθ θ(1 − θ)
p(θ) =∝ Beta(θ|0, 0)
3.6 MLE for the Poisson distribution

Likelihood:
N
Y PN
xn 1
p(D|P oi, λ) = P oi(xn |λ) = exp(−λN ) · λ n=1 · QN
n=1 n=1 xn !
Setting the derivative of Log-Likelihood to zero:

( N
)
∂ P
x−1
X
p(D|P oi, λ) = exp(−λN )λ −N λ + xn
∂λ n=1
Thus: PN
n=1 xn
λ=
N
3.7 Bayesian analysis of the Poisson distribution

We have:
p(λ|D) ∝p(λ)p(D|λ)
PN
∝ exp(−λ(N + b)) · λ n=1 xn +a−1
X
=Gamma(a + x, N + b)
This prior distribution equals introduing b prior observations with mean

a
b
.
3.8 MLE for the uniform distrbution

The likelihood goes to zero if a < max(xn ), so we must have â ≥
max(xn ), likelihood lookes like:
N
Y 1
p(D|a) =
n=1
2a
Which has a negative correlation with a, so:

n
â = max {xi }i=1
n
This model assign p(xn+1 ) = 0 if xn+1 > max {xi }i=1 , which gives a
”hard” boundary in distribution.
3.9 Bayesian analysis of the uniform distribution

The conjugate prior for uniform distribution if Pareto distribution:
p(θ) = P a(θ|K, b) = KbK θ−(K+1) [θ ≥ b]

n
Let m = max {xi }i=1 , the joint distribution is:
p(θ, D) = p(θ)p(D|θ) = KbK θ−(K+N +1) [θ ≥ b][θ ≥ m]
And the evidence is:

KbK
Z
p(D) = p(D, θ)dθ =
(N + K) max(m, b)N +K
Let µ = max {m, b}, the posterior distribution is again the form of a
Parato distribution:
p(θ, D) (N + K)µN +K [θ ≥ µ]
p(θ|D) = = = P a(θ|N + K, µ)
p(D) θN +K+1
3.10 Taxicab problem*

We skip this straightforward numerical task.
3.11 Bayesian analysis of the exponential distribution

Log-Likelihood for an exponential distribution is:
N
X
ln p(D|θ) = N ln θ − θ xn
n=1
The derivative is:

N
∂ N X
ln p(D|θ) = − xn
∂θ θ n=1
Thus in question a:
N
θM L = PN
xn n=1
We skip other questions and state that the conjugate prior for expo-
nential distribution is Gamma distribution:
p(θ|D) ∝p(θ)p(D|θ)
=Gamma(θ|a, b)p(D|θ)
X
=Gamma(θ|N + a, b + xn )
A Gamma prior introduces a − 1 prior observation with a sum b.
3.12 MAP estimation for the Bernoulli with non-conjugate

priors*
3.13 Posterior predictive distribution for a batch of data

with the dirichlet-multinomial model
Since we already have 3.51:
αj + N j
p(X = j|D, α) =
α0 + N
We can easily derive:
Y
p(D̃|D, α) = p(x|D, α)
x∈D̃
C
Y αj + Njold N new
= ( old
) j
j=1
α 0 + N
3.14 Posterior predictive for Dirichlet-multinomial*

3.15 Setting the hyper-parameters I*

3.16 Setting the beta hyper-parameters II

For paremeters of a Beta distribution α1 和α2 are connected through:
1
α2 = α1 ( − 1) = f (α1 )
m
Calculate this intergral：
Z u
1
θα1 (1 − θ)f (α1 ) = u(α1 )
l B(α 1 , f (α 1 ))
Setting this intergral u(α1 ) → 0.95 by altering α1 through numerical
method will do.
3.17 Marginal likelihood for beta-binomial under uniform

prior
The marginal likelihood is given by:
Z 1 Z 1
p(N1 |N ) = p(N1 , θ|N )dθ = p(N1 |θ, N )p(θ)dθ
0 0
We already have:
p(N1 |θ, N ) = Bin(N1 |θ, N )
p(θ) = Beta(1, 1)
Thus:
Z 1
N
p(N1 |N ) = θN1 (1 − θ)N −N1 dθ
N1
0
N
= B(N1 + 1, N − N1 + 1)
N1
N! N1 !(N − N1 )!
=
N1 !(N − N1 )! (N + 1)!
1
=
N +1
Where B is the regulizer for a Beta distribution:
Γ(a)Γ(b)
B(a, b) =
Γ(a + b)
3.18 Bayes factor for coin tossing*

Straightforward calculation.
3.19 Irrelevant features with naive Bayes

Log-Likelihood is given by:
W W
X θcw X
log p(xi |c, θ) = xiw log + log(1 − θcw )
w=1
1 − θcw w=1
In a succint way:
log p(xi |c, θ) = φ(xi )T βc

Where:
φ(xi ) = (xi , 1)T
W
θc1 X
βc = (log , ... log(1 − θcw ))T
1 − θc1 w=1
For question a:
p(c = 1|xi ) p(c = 1)p(xi |c = 1)
log = log
p(c = 2|xi ) p(c = 2)p(xi |c = 2)
p(xi |c = 1)
= log
p(xi |c = 1)
=φ(xi )T (β1 − β2 )
For question b, in a binary context:

p(c = 1)p(xi |c = 1)
p(c = 1|xi ) =
p(xi )
Thus:
p(c = 1|xi ) p(c = 1)
log = log + φ(xi )T (β1 − β2 )
p(c = 2|xi ) p(c = 2)
A word w will not affect this posterior measure as long as:
xiw (β1,w − β2,w ) = 0
Hence:
θc=1,w = θc=2,w
So the chance that word w appears in both class of documents are

equal.
In question c, we have:
1
θ̂1,w = 1 −
2 + N1
1
θ̂2,w = 1 −
2 + N2
They do not equal when N1 6= N2 so the bias effect remains. However,
this effect reduces when N grows large.
3.20 Class conditional densities for binary data

In question a, we have:
D
Y
p(x|y = c) = p(xi |y = c, x1 , ..., xi−1 )
i=1
The number of parameter is:

D
X
C· 2i = C · (2D+1 − 2) = O(C · 2D )
i=1
For question b and c, we generally think that naive models fit better
when N is large, because delicate models have problems of overfitting.
In question d,e and f, it is assumed that looking up for a value according
to a D-dimensional index cost O(D) time. It is easy to calculate the fitting
complexity: O(N D) for a naive model and O(N · 2D ) for a full model, and
the applying complexity is O(CD) and O(C · 2D ) respectively.
For question f:
X
p(y|xv ) ∝ p(xv |y) = p(xv , xh |y)
xh
Thus the complexity is multiplied by an extra const 2|xh | .
3.21 Mutual information for naive Bayes classifiers with

binary features
By definition:
XX p(xj , y)
I(X; Y ) = p(xj , y) log
xj y
p(xj )p(y)
For binary features, consider the value of xj to be zero and one, given
πc = p(y = c), θjc = p(xj = 1|y = c), θj = p(xj = 1):
X p(xj = 1, c)
Ij = p(xj = 1, c) log
c
p(xj = 1)p(c)
X p(xj = 0, c)
+ p(xj = 0, c) log
c
p(xj = 0)p(c)
X θjc 1 − θjc
= πc θjc log + (1 − θjc )πc log
c
θj 1 − θj
Which ends in 3.76.
3.22 Fitting a naive Bayesian spam filter by hand*

4 GAUSSIAN MODELS 28
4 Gaussian models
4.1 Uncorrelated does not imply independent
We first calculate the covariance of X and Y :
Z Z
cov(X, Y ) = (X − E(X))(Y − E(Y ))p(X, Y )dXdY
Z 1
1
= X(X 2 − √ )dX = 0
−1 3
The intergral ends in zero since we are intergrating an odd function in
range [-1,1], hence:
cov(X, Y )
ρ(X, Y ) = p =0
var(X)var(Y )
4.2 Uncorrelated and Gaussian does not imply indepen-

dent unless jointly Gaussian
The pdf for Y is:
p(Y = a) = 0.5 · p(X = a) + 0.5 · p(X = −a) = p(X = a)
The pdf of X is symetric with 0 as the core, so Y subject to a normal

distribution (0, 1).
For question b, we have:
cov(X, Y ) =E(XY ) − E(X) − E(Y )

=EW (E(XY |W )) − 0
=0.5 · E(X 2 ) + 0.5 · E(−X 2 ) = 0
4.3 Correlation coefficient is between -1 and 1

The statement:
−1 ≤ ρ(X, Y ) ≤ 1
Equals:
|ρ(X, Y )| ≤ 1
Hence we are to prove:
|cov(X, Y )|2 ≤ var(X) · var(Y )
Which can be drawn straightforwardly from Cauchy–Schwarz inequal-

ity in R2 .
4.4 Correlation coefficient for linearly related variables is

1 or -1
When Y = aX + b:
E(Y ) = aE(x) + b
var(Y ) = a2 var(X)
Therefore:
cov(X, Y ) =E(XY ) − E(X)E(Y )

=aE(X 2 ) + bE(X) − aE2 (X) − bE(X)
=a · var(X)
We also have:
var(X)var(Y ) = a2 · var(X)
These two make:
a
ρ(X, Y ) =
|a|
4.5 Normalization constant for a multidimensional Gaus-

sian
This can be obtained by applying the method mentioned in problem
straightforwardly, hence omitted.
4.6 Bivariate Gaussian

Straightforward algebra.
4.7 Conditioning a bivariate Gaussian

Answers are obtained by plugging figures in 4.69 straightforwardly.
4.8 Whitening vs standardizing

Practical by yourself.
4.9 Sensor fusion with known variances in 1d

Denate the two observed datasets by Y (1) and Y (2) , with size N1 , N2 ,
the likelihood is:
N1
Y N2
Y
(1) (2)
p(Y ,Y |µ) = p(Yn(1)
1
|µ) p(Yn(2)
2
|µ)
n1 =1 n2 =1
2

∝ exp A · µ + B · µ
Where we have used:

N1 N2
A=− −
2v1 2v2
N1 N2
1 X 1 X
B= Yn(1) + Y (2)
v1 n =1 1 v2 n =1 n2
1 2
Differentiate the likelihood and set is to zero, we have:

B
µM L = −
2A
The conjugate prior of this model must have form proporitional to
exp {A · µ2 + B · µ}, namely a normal distribution:
p(µ|a, b) ∝ exp a · µ2 + b · µ

The posterior distribution is:
p(µ|Y ) ∝ exp (A + a) · µ2 + (B + b) · µ

Hence we have the MAP estimation:

B+b
µM AP = −
2(A + a)
It is noticable that the MAP converges to ML estimation when obser-
vation times grow:
µM AP → µM L
The posterior distribution is another normal distribution, with:
2 1
σM AP = −
2(A + a)
4.10 Derivation of information form formulae for marginal-

izing and conditioning
Please refer to PRML chapter 2.
4.11 Derivation of the NIW posterior

The likelihood for a MVN is given by:
( N
)
− N2D −N 1 X
T −1
p(X|µ, Σ) = (2π) |Σ| 2 exp − (xi − µ) Σ (xi − µ)
2 n=1
According to 4.195:
N
X N
X
(xi − µ)T Σ−1 (xi − µ) = (x̄ − µ + (xi − x̄))T Σ−1 (x̄ − µ + (xi − x̄))
n=1 n=1
N
X
=N (x̄ − µ)T Σ−1 (x̄ − µ) + (xi − x̄)T Σ−1 (xi − x̄)
n=1
( N
)
X
T −1 −1
=N (x̄ − µ) Σ (x̄ − µ) + T r Σ (xi − x̄)(xi − x̄)T
n=1
T −1 −1

=N (x̄ − µ) Σ (x̄ − µ) + T r Σ Sx̄
The conjugate prior for MVN’s parameters (µ, Σ) is Normal-inverse-

Wishart(NIW) distribution, defined by:
1
N IW (µ, Σ|m0 , k0 , v0 , S0 ) = N (µ|m0 , Σ) · IW (Σ|S0 , v0 )
k0

1 v +D+2
− 0 2 k0 T −1 1 −1
= |Σ| exp − (µ − m0 ) Σ (µ − m0 ) − T r Σ S0
Z 2 2
Hence the posterior:

−
vX +D+2 kX 1
exp − (µ − mX )T Σ−1 (µ − mX ) − T r Σ−1 SX

p(µ, Σ|X) ∝ |Σ| 2
2 2
Where we have:
kX = k0 + N
vX = v0 + N
N x̄ + k0 m0
mX =
kX
By comparing the exponential for |Σ|,µT Σ−1 µ and µT .

Making use of AT Σ−1 A = T r AT Σ−1 A = T r Σ−1 AAT and com-

paring the constant term inside the exponential function:
N x̄x̄T + SX̄ + k0 m0 mT0 + S0 = kX mX mTX + SX
Hence
SX = N x̄x̄T + SX̄ + k0 m0 mT0 + S0 − kX mX mTX
Use the definition for mean we ends in 4.214 since:

N
X
S= xi xTt = SX̄ + N x̄x̄T
n=1
Hence the posterior distribution for MVN takes the form:N IW (mX , kX , vX , SX )
4.12
4.13 Gaussian posterior credible interval

Assume a prior distribution for an 1d normal distribution:
p(µ) = N (µ|µ0 , σ02 = 9)
And the observed variable subjects to:
p(x) = N (x|µ, σ 2 = 4)
Having observed n variables, it is vital that the probability mass of µ’s

posterior distribution is no less than 0.95 in an interval no longer than 1.
Posterior for µ is:
n
Y
p(µ|D) ∝ p(µ)p(D|µ) =N (µ|µ0 , σ02 ) N (xn |µ, σ 2 )
i=1
Yn
1 2 1 2
∝ exp − 2 (µ − µ0 ) exp − 2 (xi − µ)
2σ0 i=1
2σ

1 n
= exp (− 2 − 2 )µ2 + ...
2σ0 2σ
Hence the posterior variance is given by:
2 σ02 σ 2
σpost =
σ 2 + nσ02
Since 0.95 of probability mass for a normal distribution lies within
−1.96σ and 1.96σ, we have:
n ≥ 611
4.14 MAP estimation for 1d Gaussians

Assume the variance for this distribution σ 2 is known, the mean µ
subject to a normal distribution with mean m and variance s2 , similiar to
the question before, the posterior takes the form:
p(µ|X) ∝ p(µ)p(X|µ)
The posterior is another normal distribution, by comparing the coeffi-

cient for µ2 :
1 N
− 2
− 2
2s 2σ
And that for µ:
PN
m n=1 xn
+
s2 σ2
We have the posterior mean and variance by the technique ”completing
the square”:
2 s2 σ 2
σpost =
σ 2 + N s2
PN
m xn 2
µpost = ( 2 + n=1 2
) · σpost
s σ
Already we knew the MLE is:
PN
xi
µM L = n=1
N
When N increases, µpost converges to µM L .
Consider the variance s2 . When it increases, the MAP goes to MLE,
when in decreases, ,the MAP goes to prior mean. Prior variance quantify
our confidence in the prior guess. Intuitively, the larger the prior variance,
the less we trust the prior mean.
4.15 Sequential(recursive) updating of covariance matrix

Making use of:
nmn + xn+1
mn+1 =
n+1
What left is straightforward algebra.
4.16 Likelihood ratio for Gaussians

Consider a classifier for two classes, the generative distribution for them
are two normal distributionsp(x|y = Ci ) = N (x|µi , Σi ), by Bayes formula:
p(y = 1|x) p(x|y = 1) p(y = 1)

log = log + log
p(y = 0|x) p(x|y = 0) p(y = 0)
The second term is the ratio of likelihood probability.
When we have arbitrary covariance matrix:
s
p(x|y = 1) |Σ0 | 1 1
= exp − (x − µ1 )T Σ−1
1 (x − µ 1 ) + (x − µ 0 ) T −1
Σ0 (x − µ 0 )
p(x|y = 0) |Σ1 | 2 2
This can not be reduced further. However, it is noticable that the

decision boundary is a quardric curve in D-dimension space.
When both covariance matrixes are given by Σ:

p(x|y = 1) T −1 1 −1 T T

= exp x Σ (µ1 − µ0 ) − T r Σ (µ1 µ1 − µ0 µ0 )
p(x|y = 0) 2
The decision boundary becomes a plate.

If we assume the covariance matrix to be a diagnoal matrix, the closed
form of answer have a similiar look, with some matrix multiplation changed
into inner product or arthimatic multiplation.
4.17 LDA/QDA on height/weight data

Practise by youself.
4.18 Naive Bayes with mixed features

4.19 Decision boundary for LDA with semi tied covari-

ances
Omitting the shared parameters ends in:
p(y = 1)p(x|y = 1)
p(y = 1|x) =
p(y = 0)p(x|y = 0) + p(y = 1)p(x|y = 1)
Consider a uniform prior, this can be reduced to:
p(x|y = 1)
p(x|y = 0) + p(x|y = 1)
1
=
µ0 )T Σ−1 − µ0 ) + 21 (x − µ1 )T Σ−1
D
− 21 (x

k 2 exp − 0 (x 1 (x − µ1 ) + 1
1
=
)xT Σ−1
D
− 21 (1 1

k 2 exp − k 0 x + xT u + c + 1
Where we have used:
|Σ1 | = |kΣ0 | = k D |Σ0 |
The decision boundary is still a quardric curve. It reduces to a plate

when k = 1. When k increases, the decision boundary becomes a curve that
surrenders µ0 . When k goes to infinity, the decision boundary degenerates
to a y = 0 curve, which implies that every space out of it belongs to a
normal distribution with infinite variance.
4.20 Logistic regression vs LDA/QDA

We give a qualitative answer according to the argument ”overfitting
arises from MLE, and is in a positive correlation with the complexity of the
model(namely the number of independent parameters in the model)”.
GaussI assumes a covariance matrix propoetional to identity matrix;
GaussX has not prior assumption on covariance matrix;
LinLog assumes that different classes have the same covariance matrix;
QuadLog has not prior assumption on covariance matrix;
From the perspective of complexity:
QuadLog =GaussX > LinLog > GaussI
The accuracy of MLE follows the same order.
The argument in e is not true in general, a larger product does not

necessarily imply a larger sum.
4.21 Gaussian decision boundaries

Straightforward albegra.
4.22 QDA with 3 classes

4.23 Scalar QDA

Practice by yourself.
5 BAYESIAN STATISTICS 37
5 Bayesian statistics
5.1 Proof that a mixture of conjugate priors is indeed
conjugate
For 5.69 and 5.70, formly:
X X
p(θ|D) = p(θ, k|D) = p(k|D)p(θ|k, D)
k k
Where:
p(k, D) p(k)p(D|k)
p(k|D) = =P 0 0
p(D) k0 p(k )p(D|k )
5.2 Optimal threshold on classification probability

The posterior loss expectation is given by:
X
ρ(ŷ|x) = L(ŷ, y)p(y|x) = p0 L(ŷ, 0) + p1 L(ŷ, 1)
y
=L(ŷ, 1) + p0 (L(ŷ, 0) − L(ŷ, 1))
When two classficied result yield to the same loss:

λ01
pˆ0 =
λ01 + λ10
Hence when p0 ≥ pˆ0 , we estimete ŷ = 0。
5.3 Reject option in classifiers

The posterior loss expectation is given by:
X
ρ(a|x) = L(a, c)p(c|x)
c
Denote the class with max posterior confidence by ĉ:
ĉ = argmaxc {p(c|x)}
Now we have two applicable actions: a = ĉ or a =reject.

When a = ĉ, the posterior loss expectation is:
ρĉ = (1 − p(ĉ|x)) · λs
When reject, the posterior loss expectation is:
ρreject = λr
Thus the condition that we choose a = ĉ instead of reject is:
ρĉ ≥ ρreject
Or:
λr
p(ĉ|x) ≥ 1 −
λs
5.4 More reject options

5.5 Newsvendor problem

By:
Z Q Z Q Z +∞
E(π|Q) = P Df (D)dD − CQ f (D)dD + (P − C)Q f (D)dD
0 0 Q
We have:
Z Q Z +∞
∂
E(π|Q) = P Qf (Q)−C f (D)dD−CQf (Q)+(P −C) f (D)dD−(P −C)Qf (Q)
∂Q 0 Q
RQ R +∞
Set it to zero by making use of 0
f (D)f D + Q
f (D)dD = 1:
Q∗
P −C
Z
= F (Q∗ ) =
0 P
5.6 Bayes factors and ROC curves

Practise by yourself.
5.7 Bayes model averaging helps predictive accuracy

Expand both side of 5.127 and exchange the integral sequence:
E[L(∆, pBM A )] = H(pBM A )

We also have:
E[L(∆, pm )] = EpBM A [− log(pm )]
Substract the right side from the left side ends in:
−KL(pBM A ||pm ) ≤ 0
Hence the left side is always smaller than the right side.
5.8 MLE and model selection for a 2d discrete distribu-

tion
The joint distribution p(x, y|θ1 , θ2 ) is given by:
p(x = 0, y = 0) = (1 − θ1 )θ2
p(x = 0, y = 1) = (1 − θ1 )(1 − θ2 )
p(x = 1, y = 0) = θ1 (1 − θ2 )
p(x = 1, y = 1) = θ1 θ2
Which can be concluded as:
p(x, y|θ1 , θ2 ) = θ1x (1 − θ1 )(1−x) θ2 (1 − θ2 )(1−I(x=y))

I(x=y)
The MLE is:

XN
θM L = argmaxθ ( ln p(xn , yn |θ))
n=1
Hence:
1 − θ1 θ1 θ2
θM L = argmaxθ (N ln( ) + Nx ln( ) + NI(x=y) ln( ))
1 − θ2 1 − θ1 1 − θ2
Two parameters can be estimated independently given X and Y.
We can further rewrite the joint distribution into:
p(x, y|θ) = θx,y
Then
X
θM L = argmaxθ ( Nx,y ln θx,y )
x,y
MLE can de done by using regularization condition.

The rest is straightforward algebra.
5.9 Posterior median is optimal estimate under L1 loss

The posterior loss expectation is(where we have omitted D w.l.o.g):
Z Z a Z +∞
ρ(a) = |y − a|p(y)dy = (a − y)p(y)dy + (y − a)p(y)dy
−∞ a
Z a Z +∞ Z a Z +∞
=a p(y)dy − p(y)dy − yp(y)dy + yp(y)dy
−∞ a −∞ a
Differentiate and we have:

Z a Z +∞
∂
ρ(a) = p(y)dy − p(y) + a · 2p(a) − 2ap(a)
∂a −∞ a
Set it to zero and:

Z a Z +∞
1
p(y)dy = p(y) =
−∞ a 2
5.10 Decision rule for trading off FPs and FNs

Given:
LF N = cLF P
The critical condition for 5.115 is:

p(y = 1|x)
=c
p(y = 2|x)
Using:
p(y = 1|x) + p(y = 0|x) = 1
c
We get the threshold 1+c
.
6 FREQUENTIST STATISTICS 41
6 Frequentist statistics
The philosophy behind this chapter is out of the scope of probabilistic
ML, you should be able to find solutions to the four listed problems in a
decent textbook on mathematics statistics.
GL.
7 LINEAR REGRESSION 42
7 Linear regression
7.1 Behavior of training set error with increasing sample
size
When the training set is small at the beginning, the trained model
is over-fitted to the current data set, so the correct rate can be relatively
high. As the training set increases, the model has to learn to adapt to more
general-purpose parameters, thus reducing the overfitting effect laterally,
resulting in lower accuracy.
As pointed out in Section 7.5.4, increasing the training set is an impor-
tant method of countering over-fitting besides adding regulizer.
7.2 Multi-output linear regression

7.3 Centering and ridge regression

By rewriting x into (xT , 1)T to reduce w0 , then NLL is given by:
N LL(w) = (y − Xw)T (y − Xw) + λwT w
So:
∂
N LL(w) = 2XT Xw − 2XT y + 2λw
∂w
Therefore:
w = (XT X + λI)−1 XT y
7.4 MlE for σ 2 for linear regression

Firstly, we give the likelihood:
p(D|w, σ 2 ) =p(y|w, σ 2 , X)
N
Y
= p(yn |xn , w, σ 2 )
n=1
N
Y
= N (yn |wT xn , σ 2 )
n=1
( N
)
1 1 X T 2
= N exp − 2 (yn − w xn )
(2πσ 2 ) 2 2σ n=1
As for σ 2 :
N
∂ 2 N 1 X
log p(D|w, σ ) = − + (yn − wT xn )2
∂σ 2 2σ 2 2(σ 2 )2 n=1
We have:
N
2 1 X
σM L = (yn − wT xn )2
N n=1
7.5 MLE for the offset term in linear regression

NLL:
N
X
N LL(w, w0 ) ∝ (yn − w0 − wT xn )2
n=1
Differentiate with two parameters:

N
∂ X
N LL(w, w0 ) ∝ −N w0 + (yn − wT xn )
∂w0 n=1
N
1 X
w0,M L = (yn − wT xn ) = ȳ − wT x̄
N n=1
Centering within X and y:
Xc = X − X̂
yc = y − ŷ
The centered datasets have zero-mean, thus regression model have w0

as zero, by the same time:
wM L = (XTc Xc )−1 XTc yc
7.6 MLE for simple linear regression

Using the conclusion from problem 7.5. What left is straightforward
algebra.
7.7 Sufficient statistics for online linear regression

a and b can be solved according to hints.
For c, substituting the x in hint by y yields to the conclusion.
In d we are to prove:
(n+1) (n)
(n + 1)Cxy = nCxy + xn+1 yn+1 + nx̄(n) ȳ (n) − (n + 1)x̄(n+1) ȳ (n+1)
1
Expand the Cxy in two sides and use x̄(n+1) = x̄(n) + n+1
(xn+1 − x̄n ).
Problem e and f: practice by yourself.
7.8 Bayesian linear regression in 1d with known σ 2

Problem a: practice by yourself.
For b, choose the prior distribution:

1
p(w) ∝ N (w1 |0, 1) ∝ exp − w12
2
Reduce it into:
p(w) = N (w|w0 , V0 ) ∝

1 1 −1
exp − V−1 (w 0 − w 00 )2
− V (w 1 − w 01 )2
− V −1
(w 0 − w 00 )(w 1 − w 01 )
2 0,11 2 0,22 0,12
Formly, we take:
w01 = 0
V−1
0,22 = 1
V−1 −1
0,11 = V0,12 = 0
w00 = arbitrary
In problem c, we consider the posterior distribution for parameters:
N
Y
p(w|D, σ 2 ) = N (w|m0 , V0 ) N (yn |w0 + w1 xn , σ 2 )
n=1
The coefficients for w12 and w1 in the exponential are:

N
1 1 X 2
− − 2 x
2 2σ n=1 n
N
1 X
− 2 xn (w0 − y)
σ n=1
Hence the posterior mean and variance are given by:
2 σ2
σpost = PN
σ2 + n=1 x2n
N
2 2 1 X
E[w1 |D, σ ] = σpost (− xn (w0 − y))
σ 2 n=1
It can be noticed that accumulation of samples reduces the posterior
variance.
7.9 Generative model for linear regression

For sake of convinence, we consider a centered dataset(without chang-
ing symbols):
w0 = 0
µx = µy = 0
By covariance’s definition:
ΣXX = X T X
ΣY X = Y T X
Using the conclusion from section 4.3.1:
p(Y |X = x) = N (Y |µY |X , ΣY |X )
Where:
µY |X = µY + ΣY X Σ−1 T T
XX (X − µX ) = Y X(X X)
−1
X = wT X
7.10 Bayesian linear regression using the g-prior

Recall ridge regression model, where we have likelihood:
N
Y
p(D|w, σ 2 ) = N (yn |wT xn , σ 2 )
n=1
The prior distribution is Gaussian-Inverse Gamma distribution:
p(w, σ 2 ) =N IG(w, σ 2 |w0 , V0 , a0 , b0 ) = N (w|w0 , σ 2 V0 )IG(σ 2 |a0 , b0 )

1 1 1 T 2 −1
= D 1 exp − (w − w0 ) (σ V0 ) (w − w0 ) ·
(2π) 2 |σ 2 V0 | 2 2
a0
b0 b0
(σ 2 )−(a0 +1) exp − 2
Γ(a0 ) σ
ba0 0 (w − w0 )T V−1

2 −(a0 + D
2 +1) · exp
0 (w − w0 ) + 2b0
= D 1 (σ ) −
(2π) 2 |V0 | 2 Γ(a0 ) 2σ 2
The posterior distribution takes the form:
p(w, σ 2 |D) ∝ p(w, σ 2 )p(D|w, σ 2 )
ba0 0 (w − w0 )T V−1

2 −(a0 + D
2 +1) ·exp
0 (w − w0 ) + 2b0
∝ D 1 (σ ) − ·
(2π) 2 |V0 | 2 Γ(a0 ) 2σ 2
( P )
N
2 −N n=1 (yn − wT xn )2
(σ ) 2 · exp −
2σ 2
Comparing the coefficient of σ 2 :
N
aN = a0 +
2
Comparing the coefficient of wT w:
N
X
V−1 −1
N = V0 + xn xTn = V−1 T
0 +X X
n=1
Comparing the coefficient of w:

N
X
V−1 −1
N wN = V0 w0 + yn xn
n=1
Thus:
wN = VN (V−1 T
0 w0 + X y)
Finally, comparing the constant term inside the exponential:

1
bN = b0 + (wT0 V−1 T T −1
0 w0 + y y − wN VN wN )
2
We have obtained 7.70 to 7.73, which can be concluded into 7.69:
p(w, σ 2 |D) = N IG(w, σ 2 |wN , VN , aN , bN )

8 LOGISTIC REGRESSION 48
8 Logistic regression
8.1 Spam classification using logistic regression
8.2 Spam classification using naive Bayes

8.3 Gradient and Hessian of log-likelihood for logistic re-

gression
∂ exp(−a) 1 e−a
σ(a) = = = σ(a)(1 − σ(a))
∂a (1 + exp(−a)) 2 1 + e 1 + e−a
−a
∂
g(w) = N LL(w)
∂w
N
X ∂
= [yi log µi + (1 − yi ) log(1 − µi )]
n=1
∂w
N
X 1 −1
= yi σ(1 − σ) − xi + (1 − yi ) σ(1 − σ) − xi
n=1
σ 1−σ
N
X
= (σ(wT xi ) − yi )xi
n=1
For an arbitrary non-zero vectoru(with proper shape):
uT XT SXu = (Xu)T S(Xu)
Since S is positive definite, for arbitrary non-zero v：
vT Sv > 0
Assume X is a full-rank matrix, Xu is not zero, thus:
(Xu)T S(Xu) = uT (XT SX)u > 0
So XT SX is positive definite.
8.4 Gradient and Hessian of log-likelihood for multino-

mial logistic regression
By considering one independent component each time, the complexity
in form caused by tensor product is reduced. For a specific w∗ :
N C
∂ X ∂ ∗T
X
N LL(W) = − [y w xn − log(
∗ n∗
exp(wTc xn ))]
∂w∗ n=1
∂w c=1
N N
X exp(w∗T xn ) X
−yn∗ xn + PC )xn = (µn∗ − yn∗ )xn
exp(wTx
n=1 c=1 c n n=1
Combine the independent solutions for all classes into one matrix yield
8.38.
On soloving for Hessian matrix, consider to take gradient w.r.t w1 and
w2 :
N
∂ X
H1,2 = ∇w2 ∇w1 N LL(W) = (µn1 − yn1 )xn
∂w2 n=1
When w1 and w2 are the same:
N N
∂ X T
X ∂
(µn1 − yn1 )xn = µn1 xTn
∂w1 n=1 n=1
∂w 1
N
exp(wT1 xn )( exp)xn − exp(wT1 xn )2 xn T
X P
P xn
n=1
( exp)2
N
X
= µn1 (1 − µn1 )xn xTn
n=1
When w1 and w2 are different:

N N
∂ X X − exp(wT2 xn ) exp(wT1 xn )xn T
µn1 xTn = P xn
∂w2 n=1 n=1
( exp)2
N
X
= −µn1 µn2 xn xTn
n=1
Ends in 8.44。
P
The condition c ync = 1 is used from 8.34 to 8.35.
8.5 Symmetric version of l2 regularized multinomial lo-

gistic regression
Adding a regularizer equals doing a posterior estimationg, which equals
introducing a languarge multipler for a new constraint. In this problem a
Gaussian prior distribution with a homogeneous diagonal matrix is intro-
duced, this leads to the constraint wcj = 0.
At optima, the gradient in 8.47 goes to zero. Assume that µ̂cj = ycj ,
PC
then g(W) = 0. The extra regularization is λ c=1 wc = 0, which equals D
PC
independent linear constraints, with form of: for j = 1...D, c=1 ŵcj = 0.
8.6 Elementary properties of l2 regularized logistic re-

gression
The first term of J(w)’s Hessian is positive definite(8.7), the second
term’s Hessian is positive definite as well(λ > 0). Therefore this function
has a positive definite Hessian, it has a global optimum.
The form of posterior distribution takes:
p(w|D) ∝ p(D|w)p(w)
p(w) = N (w|0, σ −2 I)
1 T
N LL(w) = − log p(w|D) = − log p(D|w) + w w+c
2σ 2
Therefore:
1
λ=
2σ 2
The number of zero in global optimun is related to the value of λ, which
is in a negative correlationship with the prior uncertainty of w. The less
the uncertainty is, the more that w converges to zero, which ends in more
zeros in answer.
If λ = 0, which implies prior uncertainty goes to infnity. Then posterior
estimation converges to MLE. As long as there is no constraint on w, it is
possible that some component of w goes to infinity.
When λ increase, the prior uncertainty reduces, hence the over-fitting
effect reduces. Generally this implide a decrease on training-set accuracy.
At the same time, this also increases the accuracy of model on test-set, but
it does not always happen.
8.7 Regularizing separate terms in 2d logistic regression

9 GENERALIZED LINEAR MODELS AND THE EXPONENTIAL FAMILY52
9 Generalized linear models and the

exponential family
9.1 Conjugate prior for univariate Gaussian in exponen-
tial family form
The 1d Gaussian distribution is:

1 1
N (x|µ, σ 2 ) = √ exp − 2 (x − µ)2
2πσ 2 2σ
Rewrite it into:
µ2 ln(2πσ 2 )

1 1
p(x|µ, σ ) = exp − 2 x2 + 2 x −
2
+
2σ σ 2σ 2 2
λµ2 ln(2π)
Denote θ = (− λ2 , λµ)T ，A(θ) = 2
+ 2
− ln λ
2
，φ(x) = (x2 , x)T .
Consider the likelihood with datasetD:
( N
)
X
log p(D|θ) = exp θT ( φ(xn )) − N · A(θ)
n=1
According to the meaning of prior distribution, we set a observation

background in order to define a prior distribution. The sufficient statistics
is the only thing matters by the form of exponential family. Assume that
we have M prior observations. The mean of them and their square are 为v1
and v2 respectively, then the prior distribution takes the form:
p(θ|M, v1 , v2 ) = exp {θ1 · M v1 + θ2 · M v2 − M · A(θ)}

λ M 2 M M
= exp − M v1 + λµM v2 − λµ − ln 2π + ln λ
2 2 2 2
It has three independent parameters. We are to prove that is equals
1
p(µ, λ) = N (µ|γ, λ(2α−1) )Ga(λ|α, β). Expand it into exponential form and
ignore the terms independent with µ, λ:

λ(2α − 1) 2 λ(2α − 1) 2
p(µ, λ) = exp (α − 1) ln λ − βλ − µ − γ
2 2

1
· exp λ(2α − 1)µγ + ln λ
2
9 GENERALIZED LINEAR MODELS AND THE EXPONENTIAL FAMILY53
Compare the coefficients for λµ2 , λµ, λ, ln λ, we obtain:

(2α − 1) M
− =−
2 2
γ(2α − 1) = M v2
(2α − 1) 2 1
γ − β = − M v1
2 2
1 M
(α − 1) + =
2 2
Combining them ends in:
M +1
α=
2
M 2
β= (v + v1 )
2 2
γ = v2
Thus two distributions are equal with naive change of variables’ names.
9.2 The MVN is in the exponential family

Here you can find a comprehensive solution:
https://stats.stackexchange.com/questions/231714/sufficient-statistic-for-multivari
10 DIRECTED GRAPHICAL MODELS(BAYES NETS) 54
10 Directed graphical models(Bayes nets)

...
11 MIXTURE MODELS AND THE EM ALGORITHM 55
11 Mixture models and the EM algorithm

11.1 Student T as infinite mixture of Gaussian
The 1d Student-t distribution takes the form:
Γ( v2 + 21 ) 1 1 (x − µ)2 − v+1
St(x|µ, σ 2 , v) = ( ) 2 (1 + ) 2
Γ( v2 ) πvσ 2 vσ 2
Consider the left side of 11.61:

σ2
Z
v v
N (x|µ, )Ga(z| , )dz
z 2 2
Z √ o ( v ) v2 v
z n z
2 −1 exp
n v o
= √ exp − 2 (x − µ)2 2
z − z dz
2πσ 2 2σ Γ( v2 ) 2
v v2 Z 2

1 (2) v−1 v (x − µ)
=√ v z 2 exp −( + )z dz
2πσ Γ( 2 )
2 2 2σ 2
The integrated function is the terms related to z in Gamma distribution

2
Ga(z| v+1
2
, (x−µ)
2σ 2
+ v2 ), which gives to the normalized term’s inverse.
v (x − µ)2 v + 1 (x − µ)2 v − v+1

Z
v−1
z 2 exp −( + )z dz = Γ( )( + ) 2
2 σ2 2 2σ 2 2
Plug in can derive 11.61.
11.2 EM for mixture of Gaussians

We are to optimize:
XN
Q(θ, θold ) =Ep(z|D, θold )[ log(xn , zn |θ)]
n=1
N
X K
Y
= E[log (πk p(xn |zk , θ))znk ]
n=1 k=1
N X
X K
= rnk log(πk p(xn |zk , θ))
n=1 k=1
Where:
rnk = p(znk = 1|xn , θold )
When the emission distribution p(x|z, θ) is Gaussian, consider the terms

involveµk in Q(θ, θold ) first:
N N
X X 1
rnk log p(xn |zk , θ) = rnk (− )(xn − µk )T Σ−1 (xn − µk ) + C
n=1 n=1
2
Setting the derivative to zero results in:

N
X
rnk (µk − xn ) = 0
n=1
And we obtain 11.31:

PN
n=1 rnk xn
µk = P N
n=1 rnk
For terms involve Σk in Q(θ, θold ):
N N
X X 1
rnk log p(xn |zk , θ) = rnk (− )(log |Σk |+(xn −µk )T Σ−1 (xn −µk ))+C
n=1 n=1
2
Using the same way as in 4.1.3.1:

N
( N
)
X X
L(Σ−1 = Λ) = ( rnk ) log |Λ| − T r ( rnk (xn − µk )(xn − µk )T )Λ
n=1 n=1
The balance condition is:

N
X N
X
−T
( rnk )Λ = rnk (xn − µk )(xn − µk )T
n=1 n=1
Obtain 11.32:
PN
n=1 rnk (xn − µk )(xn − µk )T
Σk = PN
n=1 rnk
11.3 EM for mixtures of Bernoullis

During the MLE for mixtures of Bernoullis, consider(D = 2 marks the
number of potential elements):
N K N D
∂ XX X ∂ X
rnk log p(xn |θ, k) = rnk ( xni log µki )
∂µkj n=1 k=1 n=1
∂µkj i
N
X 1
= rnk xnj
n=1
µkj
P
Introduce a multipler to constrain j µkj = 1, then condition for the
derivative to be zero is:
PN
n=1 rnk xnj
µkj =
λ
Summer over all j:
D D N N D PN
X 1 XX 1X X rnk
1= µkj = rnk xnj = rnk xnj = n=1
j=1
λ j=1 n=1 λ n=1 j=1
λ
Results in:
N
X
λ= rnk
n=1
Hence 11.116。
Introduce a prior:
α−1 β−1
p(µk0 ) ∝ µk0 µk1
The zero-derivative condition becomes:

PN
rnk xn0 + α − 1
µk0 = n=1
λ
PN
rnk xn1 + β − 1
µk1 = n=1
λ
And:
N
1 X
1 = µk0 + µk1 = ( rnk (xn0 + xn1 ) + α + β − 2)
λ n=1
N
X
λ= rnk + α + β − 2
n=1
Hence 11.117。
11.4 EM for mixture of Student distributions

The log-likelihood for complete data set is:
Σ λ λ
lc (x, z) = log(N (x|µ, )Ga(z| , ))
z 2 2
D 1 D z
=− log(2π) − log |Σ| + log(z) − (x − µ)T Σ−1 (x − µ)+
2 2 2 2
v v v v v
log( ) − log(Γ( )) + ( − 1) log(z) − z
2 2 2 2 2
Sum the terms involving v:
v v v v
lv (x, z) = log( ) − log(Γ( )) + (log(z) − z)
2 2 2 2
The likelihood w.r.t v on complete data set is:
N
vN v v vX
Lv = log( ) − N log(Γ( )) + (log(zn ) − zn )
2 2 2 2 n=1
Setting derivative to zero gives:

PN
∇Γ( v2 ) v n=1 (log(zn ) − zn )
v − 1 − log( ) =
Γ( 2 ) 2 N
For µ and Σ:
1 z
lµ,Σ (x, z) = − log |Σ| − (x − µ)T Σ−1 (x − µ)
2 2
N
N 1X
Lµ,Σ = log |Σ| − zn (xn − µ)T Σ−1 (xn − µ)
2 2 n=1
Hence equals the MLE used for MVN.
11.5 Gradient descent for fitting GMM

From the given information:
X
p(x|θ) = πk N (x|µk , Σk )
k
N
X
l(θ) = log p(xn |θ)
n=1
Deriavte w.r.t µk :
N
πk N (xn |µk , Σk )∇µk − 12 (xn − µk )T Σ−1

∂ k (xn − µk )
X
l(θ) = PK
∂µk n=1 k0 =1 πk N (xn |µk , Σk )
0 0 0
N
X
= rnk Σ−1
k (xn − µk )
n=1
w.r.t πk :
N N
∂ X N (xn |µk , Σk ) 1 X
l(θ) = PK = rnk
∂πk n=1 k0 =1 πk N (xn |µk , Σk )
0 0 0 πk n=1
Using Languarge multipler ends in:

PN
rnk
πk = n=1
λ
Sum over k and normalize:
PN
n=1 rnk
πk =
N
For Σk :
N
∂ X πk ∇Σk N (xn |µk , Σk )
l(θ) = PK
∂Σk n=1 k0 =1 pik N (xn |µk , Σk )
0 0 0
Where:

1 1 1 T −1
∇Σk N (x|µk , Σk ) = D 1 exp − (x − µ k ) Σk (x − µ k ) ∇Σk ·
(2π) 2 |Σk | 2 2

1 −1 −1
∇Σk (− (x − µk )Σk (x − µk )) − Σk ∇Σk |Σk |
2
= N (x|µk , Σk )∇(log N (x|µk , Σk ))
Thus we have:
PN
n=1 rnk (xn − µk )(xn − µk )T
Σk = PN
n=1 rnk
11.6 EM for a finite scale mixture of Gaussians

J and K are independent, using Bayes’ rules(we have omitted θ in
condition w.l.o.g):
p(Jn = j, Kn = k, xn )
p(Jn = j, Kn = k|xn ) =
p(xn )
p(Jn = j)p(Kn = k)p(xn |Jn = j, Kn = k)
= P
Jn ,Kn p(Jn , Kn , xn )
pj qk N (xn |µj , σk2 )
= Pm Pl 2
Jn =1 Kn =1 pJn qKn N (xn |µJn , σK n
)
Derive the form of auxiliary fucntion Q(θnew , θold ):

N
X
Q(θnew , θold ) =Eθold log p(xn , Jn , Kn |θnew )
n=1
N
X m Y
Y l
= E[log( p(xn , Jn , Kn |θnew )I(Jn =j,Kn =k) )]
n=1 j=1 k=1
X m X
N X l
= E(I(Jn = j, Kn = k))(log pj + log qk + log N (xn |µj , σk2 ))
n=1 j=1 k=1
X X X
= rnjk log pj + rnjk log qk + rnjk log N (xn |µj , σk2 )
n,j,k n,j,k njk
We are to optimize parameters p, q, µ, σ 2 . It is noticealbe that p and q

can be optimized independently. Now fix σ 2 and optimize µ:
∂ X X
rnj 0 k N (xn |µj , σk2 ) = rnjk ∇µk N (xn |µj , σk2 )
∂µj n,j 0 ,k n,k
X x n − µj
= rnjk N (xn |µj , σk2 )
n,k
σk2
And we ends in:

rnjk N (xn |µj , σk2 ) xσn2
P
n,k k
µj = P 2 1
n,k rnjk N (xn |µj , σk ) σ 2 k
11.7 Manual calculation of the M step for a GMM

11.8 Moments of a mixture of Gaussians

For the expectation of mixture distribution:
Z X
E(x) = x πk N (x|µk , Σk )dx
k
X Z
= πk ( xN (x|µk , Σk )dx)
k
X
= πk µk
k
Using cov(x) = E(xxT ) − E(x)E(x)T , we have:

Z X
E(xxT ) = xxT πk N (x|µk , Σk )dx
X Zk
= πk xxT N (x|µk , Σk )dx
k
Where:
Z
xxT N (x|µk , Σk )dx =EN (µk ,Σk ) (xxT )
=covN (µk ,Σk ) (x) + EN (µk ,Σk ) (x)EN (µk ,Σk ) (x)T
=Σk + µk µTk
Therefore:
X
cov(x) = πk (Σk + µk µTk ) − E(x)E(x)T
k
11.9 K-means clustering by hand

11.10 Deriving the K-means cost function

For every term sum over k, apply 11.134 onto the inner and outer sum
process:
X X X
(xi − xi0 )2 = nk s2 + nk (x̄k − xi )2
i:zi =k i0 :zi0 =k i:zi =k
=n2k s2 + nk (nk s2 )
=2nk sk
The right side of 11.131 equals to sum over k:

X
nk (xi − x̄k )2 = nk (nk s2 + n(x̂n − x̂n ))
i:zi =k
Thus 11.131.
11.11 Visible mixtures of Gaussians are in exponential

family
Encode latent variable with hot-pot code, zc = I(x is generated from
the c distribution), then(omit θ in condition w.l.o.g):
C
Y
p(z) = πczc
c=1
C
Y 1 1
p(x|z) = (p 2
exp − 2 (x − µc ) )zc
c=1
2πσc
2 2σc
The log for joint distribution is:
C
Y πc 1
log p(x, z) = log (p exp − 2 (x − µc )2 )zc
c=1
2πσc2 2σc
C
X 1 1
= zc (log πc − log 2πσc2 − 2 (x − µc )2 )
c=1
2 2σc
Which is a sum of some inner products, hence an exponential family.The

sufficient statics are linear combinations of z, zx and zx2 .
11.12 EM for robust linear regression with a Student t

likelihood
Using the complete data likelihood w.r.t µ derived in 11.4.5:
N
1 X
LN (µ) = zn (yn − wT xn )2
2σ 2 n=1
Set the deriavte to zero:

N
X N
X
T
w zn xn xTn = zn yn xTn
n=1 n=1
This means:
XN N
X
wT = ( zn yn xTn )( zn xn xTn )−1
n=1 n=1
11.13 EM for EB estimation of Gaussian shrinkage model

For every j, 5.90 takes different forms(this equals E-step):
p(x̄i |µ, t2 , σ 2 ) = N (x̄j |µ, t2 + σj2 )
Integrate out θj , the marginal likelihood is given by:

D D
Y 1 X 1
log N (x̄j |µ, t2 + σj2 ) = (− ) log 2π(t2 + σj2 ) + 2 2
(x̄j − µ)2
j=1
2 j=1
t + σj
Then we optimize respectively(this equals M-step):

PD x̄j
j=1 t2 +σj2
µ = PD 1
j=1 t2 +σj2
t2 satisfies:
D
X (t2 + σ 2 ) − (x̄j − µ)2
j=1
(t2 + σj2 )2
11.14 EM for censored linear regression

Unsolved.
11.15 Posterior mean and variance of a truncated Gaus-

sian
ci −µi
We denote A = σ
, for mean:
E[zi |zi ≥ ci ] = µi + σE[i |i ≥ A]
And we have:
Z +∞
1 φ(A)
E[i |i = i N (i |0, 1)dx = = H(A)
p(i ≥ A) A 1 − Φ(A)
In the last step we use 11.141 and 11.139, plug it up:
E[zi |zi ≥ ci ] = µi + σH(A)
Now to calculate the expectation for square term:
E[zi2 |zi ≥ ci ] = µ2i + 2µi σE[i |i ≥ A] + σ 2 E[2i |i ≥ A]

To address E[2i |i ≥ A], expand the hint from question:

d
(wN (w|0, 1)) = N (w|0, 1) − w2 N (w|0, 1)
dw
We have:
Z c
w2 N (w|0, 1)dw = Φ(c) − Φ(b) − cN (c|0, 1) + bN (b|0, 1)
b
+∞
1 − Φ(A) + Aφ(A)
Z
1
E[2i |i ≥ A] = w2 N (w|0, 1)dw =
p(i ≥ A) A 1 − Φ(A)
Plug it into the conclusion drawn from question a:
1 − Φ(A) + Aφ(A)
E[zi2 |zi ≥ ci ] = µ2i + 2µi σH(A) + σ 2
1 − Φ(A)
= µ2i + σ 2 + H(A)(σci + σµi )

12 LATENT LINEAR MODELS 65
12 Latent linear models

12.1 M-step for FA
Review the EM for FA(Fator-Analysis) first. Basically, we have(centralize
X to cancel µ w.l.o.g):
p(z) = N (z|0, I)
p(x|z) = N (x|Wz, Ψ)
And:
p(z|x) = N (z|m, Σ)
Σ = (I + WT Ψ−1 W)−1
m = ΣWT Ψ−1 xn
Denote xn ’s latent variable as zn . The log-likelihood for complete data

set{x, z} is:
N
Y N
X
log p(xn , zn ) = log p(zn ) + log p(xn |zn )
n=1 n=1
With prior log p(z) that can be omitted with parameter 0 and I, hence:
XN
Q(θ, θold ) =Eθold [ log p(xn |zn , θ)]
n=1
N
X 1 1
=E[ c− log |Ψ| − (xn − Wzn )T Ψ−1 (xn − Wzn )]
n=1
2 2
N
N 1X
=C − log |Ψ| − E[(xn − Wzn )T Ψ−1 (xn − Wzn )]
2 2 n=1
N N N
N 1 X T −1 1X X
=C − log |Ψ| − xn Ψ xn − E[zTn WT Ψ−1 Wzn ] + xTn Ψ−1 WE[zn ]
2 2 n=1 2 n=1 n=1
N N N
N 1X 1X X
xn Ψ−1 xn − T r WT Ψ−1 WE[zn zTn ] + xTn Ψ−1 WE[zn ]

=C − log |Ψ| −
2 2 n=1 2 n=1 n=1
As long as p(z|x, θold ) = N (z|m, Σ), we have:
E[zn |xn ] = ΣWT Ψ−1 x

E[zn zTn |xn ] = cov(zn |xn )+E[zn |xn ]E[zn |xn ]T = Σ+(ΣWT Ψ−1 x)(ΣWT Ψ−1 x)T
From now on, the x and θold are omitted from conditions when calcu-
lating expectation.
Optimize w.r.t W:
N N
∂ X X
Q= Ψ−1 xn E[zn ]T − Ψ−1 WE[zn zTn ]
∂W n=1 n=1
Set it to zero:
XN N
X
W=( xn E[zn ]T )( E[zn zTn ])−1
n=1 n=1
Optimize w.r.t Ψ−1 :

N N N
∂ N 1X 1X X
−1
Q= Ψ− xn xTn − WE[zn zTn ]WT + WE[zn ]xn
∂Ψ 2 2 n=1 2 n=1 n=1
Plug in the expression of W:

N
1 X
Ψ= ( xn xTn − WE[zn ]xTn )
N n=1
Assume Ψ to be a diagnal matrix:

N
1 X
Ψ= diag( xn xTn − WE[zn ]xTn )
N n=1
This solution comes from ”The EM Algorithm for Mixtures of Factor

Analyzers, Zoubin Gharamani, Geoffrey E.Hinton, 1996”, where the EM for
mixtures of FA is given as well.
12.2 MAP estimation for the FA model

Assume prior p(W) and p(Ψ). Compare with the question before, the
M-step needs to be moderated:
∂
(Q + log p(W)) = 0
∂W
∂
(Q + log p(Ψ)) = 0
∂Ψ
12.3 Heuristic for assessing applicability of PCA*

Need pictures for illustration here!
12.4 Deriving the second principal component

For:
N
1 X
J(v2 , z2 ) = (xn − zn1 v1 − zn2 v2 )T (xn − zn1 v1 − zn2 v2 )
N n=1
Consider the derivative w.r.t one component of z2 :

∂ 1
J = (2zm2 vT2 v2 − 2vT2 (xm − zm1 v1 )) = 0
∂zm2 N
Using vT2 v2 = 1 and vT2 v1 = 0 yields to:
zm2 = vT2 xm
Since C is symmitric, use the constrain on v1 and v2 . We apply SVD

onto C first:
C = OT ΛO
Where:
Λ = diag {λ1 , λ2 , ...}
Are C’s eigenvalues from the largest to the smallest.
OT = {u1 , u2 , ...}
Are eigenvectors, that are vertical to each other uTi uj = I(i = h).
Withu1 = v1 .
Under constrains vT2 v2 = 1 and vT2 v1 = 0, we are to minimize:
(Ov2 )T Λ(Ov2 )
Notice Ov2 means a transform on v2 , with its length unchanged. And

(Ov2 )T Λ(Ov2 ) measures the sum of the vector’s components’ square timed
by Λ’s eigenvalues. Hence the optimum is reached with all length converges
to the component associated to the largest eigenvalue, which means:
uTi v2 = I(i = 2)
Therefore:
v2 = u2
12.5 Deriving the residual error for PCA
K
X K
X K
X
2 T
||xn − znj vj || =(xn − znj vj ) (xn − znj vj )
j=1 j=1 j=1
N
X N
X
=xTn xn + 2
znj − 2xTn znj vj
j=1 j=1
Use vTi vj = I(i = j), znj = xTn vj . We ends in the conclusion of a.

K
X K
X
||xn − znj vj ||2 = xTn xn − 2 vTj xn xTn vj
j=1 j=1
Plug in vTj Cvj = λj and sum over n can draw the conclusion in b.
Plug K = d into the conclusion in b, we have:
N d
1 X T X
JK=d = xn xn − λj = 0
N n=1 j=1
N d
1 X T X
xn xn − λj = 0
N n=1 j=1
In general cases:
d
X K
X K
X
JK = λj − λj = λj
j=1 j=1 j=d+1
12.6 Derivation of Fisher’s linear discriminant

（need reference）
12.7 PCA via successive deflation

This problem involves the same technique used in solving 12.4, hence
omitted.
12.8 Latent semantic indexing

12.9 Imputation in a FA model*

wtfxv ？
wtfxh ？
12.10 Efficiently evaluating the PPCA density

With:
p(z) = N (z|0, I)
p(x|z) = N (x|Wz, σ 2 I)
Use the conclusion from chapter 4.
N (x) = N (x|0, σ 2 I + WWT )
Deriavtion for MLE in 12.2.4 can be found in ”Probabilistic Principal

Component Analysis,Michael E.Tipping, Christopher M.Bishop,1999”.
Plug in the MLE, thence the covariance matrix(D ∗ D)’s inverse can be
computed:
1
(σ 2 I + WWT )−1 = σ −2 I − σ −2 W( WT W + σ −2 I)−1 WT σ −2
σ −2
Which involves only inversing a L ∗ L matrix.
12.11 PPCA vs FA
Practice by youself.
13 SPARSE LINEAR MODELS 70
13 Sparse linear models

13.1 Partial derivative of the RSS
Define:
N
X
RSS(w) = (yn − wT xn )2
n=1
Straightforwardly:
N
∂ X
RSS(w) = 2(yn − wT xn )(−xnj )
∂wj n=1
N
X D
X
=− 2(xnj yn − xnj wi xni )
n=1 i=1
N
X D
X
=− 2(xnj yn − xnj wi xni − x2nj wj )
n=1 i6=j
With wj ’s coefficient:
N
X
aj = 2 x2nj
n=1
Other irrelevent terms can be absorbed into:

N
X
cj = 2 xnj (yn − wT−j xn,−j )
n=1
In the end:
cj
wj =
aj
13.2 Derivation of M-step for EB for linear regression

We give the EM for Automatic Relevance Determination(ARD). For
linear regression scene:
p(y|x, w, β) = N (y|Xw, β −1 )
p(w) = N (w|0, A−1 )
A = diag(α)
In E-step, we are to estimate expectation of w. Using linear Gaussian

relationship:
p(w|y, α, β) = N (µ, Σ)
Σ−1 = A + βXT X
µ = Σ(βXT y)
Then:
Eα,β [w] = µ
Eα,β [wwT ] = Σ + µµT
For auxiliay function:
Q(α, β, αold , β old ) =Eαold ,β old [log p(y, w|α, β)]

=E[log p(y|w, β) + log p(w|)]
1 X
= E[N log β − β(y − Xw)T (y − Xw) + log αj − wT A−1 w]
2 j
In E-step, we need E[w] and E[wwT ], which have been computed:

Introduce a prior for component in α and β:
Y
p(α, β) = Ga(αj |a + 1, b) · Ga(β|c + 1, d)
j
Hence the posterior auxiliary function is:

X
Q0 = Q + log p(α, β) = Q + (a log αj − bαj ) + (c log β − dβ)
j
In M-step, optimize w.r.t αi :

∂ 0 1 E[wi2 ] a
Q = − + −b
∂αi 2αi 2 αi
Set it to zero:
1 + 2a
αi =
E[wi2 ] − b
Optimize w.r.t β:
∂ 0 N c
Q = − E[||y − Xw||2 ] + − d
∂β 2β β
End in:
N + 2c
β=
E[||y − Xw||2 ] + 2d
Expand the expectation ends in 13.168.
13.3 Derivation of fixed point updates for EB for linear

regression*
Unsolved.
13.4 Marginal likelihood for linear regression*

13.5 Reducing elastic net to lasso

Expand both sides of 13.196, the right side:
J1 (cw) =(y − cXw)T (y − cXw) + c2 λ2 wT w + λ1 |w|1

=yT y − c2 wT XT Xw − 2yT Xw + c2 λ2 wT w + λ1 |w|1
The left side:

!T !
y − cXw y − cXw
J2 (w) = √ √ + cλ1 |w|1
−c λ2 w −c λ2 w
=(y − cXw)T (y − cXw) + c2 λ2 wT w + cλ1 |w|1
=yT y + c2 wT XT Xw − 2yT Xw + c2 λ2 wT w + cλ1 |w|1
Hence 13.196 and 13.195 are equal.

This shows elastic net regularization, which pick a regularing term as
a linear combination of l1 andl0 equals a lasso one.
13.6 Shrinkage in linear regression

For ordinary least square:
RSS(w) = (y − Xw)T (y − Xw)
Using XT X = I:
RSS(w) = c + wT w − 2yT Xw
Take the derivative:

N
∂ X
RSS(w) = 2wk − 2 yn xnk
∂wk n=1
We have:
N
X
ŵkOLS = yn xnk
n=1
In ridge regression:
RSS(w) = (y − Xw)T (y − Xw) + λwT w
Take the derivative:

N
X
(2 + 2λ)wk = 2 yn xnk
n=1
Thus PN
yn xnk
ŵkridge = n=1
1+λ
Solution for lasso regression using subderivative is exploited in 13.3.2,
which concludes in 13.63:
λ
ŵklasso = sign(ŵkOLS )(|ŵkOLS | − )+
2
Observe picture 13.24, it is easy to address the black line as OLS, gray
one Ridge and dotted one lasso. And λ1 = λ2 = 1. It is noticeable that
ridge cause a shrinkage to horizontal axis while lasso cause a sharp shrinkage
to zero under certain threshold.
13.7 Prior for the Bernoulli rate parameter in the spike

and slab model
D
Y
p(γ|α1 , α2 ) = p(γd |α1 , α2 )
d=1
Integrate out πd :
Z
1
p(γd |α1 , α2 ) = p(γd |πd )p(πd |α1 , α2 )dπd
B(α1 , α2 )
Z
1
= πdγd (1 − πd )(1−γd ) πdα1 −1 (1 − πd )α2 −1 dπd
B(α1 , α2 )
Z
1
= πdα1 +γd −1 (1 − πd )α2 +1−γd −1 dπd
B(α1 , α2 )
B(α1 + γd , α2 + 1 − γd ) Γ(α1 + α2 ) Γ(α1 + γd )Γ(α2 + 1 − γd )
= =
B(α1 , α2 ) Γ(α1 )Γ(α2 ) Γ(α1 + α2 + 1)
Therefore(N1 marks the number of 1 in γ):
Γ(α1 + α2 )N Γ(α1 + 1)N1 Γ(α2 + 1)N −N1

p(γ|α1 , α2 ) =
Γ(α1 )N Γ(α2 )N Γ(α1 + α2 + 1)N
N1 N −N1
(α1 + 1) (α2 + 1)
=
(α1 + α2 + 1)N
And:
α2 + 1 α1 + 1
log p(γ|α1 , α2 ) = N log + N1 log
α1 + α2 + 1 α2 + 1
13.8 Deriving E step for GSM prior
γ2
Z
1
Lap(wj |0, ) = N (wj |0, τj2 )Ga(τj2 |1, )dτj2
γ 2
Take Laplace transform/generating transform to both sides:
To calculate:
1 p(wj |τj2 )p(τj2 ) 2
Z Z
1 1 2 2
E[ |wj ] = p(τ |w j )dτ = dτj
τj2 τj2 j j
τj2 p(wj )
Z
1 1
= N (wj |0, τj2 )p(τj2 )dτj2
p(wj ) τj2
According to 13.200, it reduces to:

1 −1 d
Z
N (wj |0, τj2 )p(τj2 )dτj2
p(wj ) |wj | dwj
Because:
d 1 d
log p(w) = p(w)
dw p(w) dw
This gives 13.197:
1 −1 d 1 d
p(wj ) = − log p(wj )
p(wj ) |wj | dwj |wj | dwj
！此题存疑，Hint 1和Hint 2中可能均有印刷错误。
13.9 EM for sparse probit regression with Laplace prior

Straightforward Probit regression involves no latent variable. Intro-
ducing Laplace prior for linear factor w results in its lasso version. Since
Laplace distribution is a continuous mixture of Gaussian, a latent variable

τ 2 with the same dimension as w is introduced. The PGM for Probit re-
gression looks like:
γ → τ2 → w → y ← X
The joint distribution is:

D
Y D
Y N
Y
2
p(γ, τ , w, y|X) = p(γ) p(τd2 |γ) p(wd |τd2 ) Φ(wT xn )yn (1−Φ(wT xn ))1−yn
d=1 d=1 n=1
For concise, we set γ as constant, according to 13.86:

γ2
p(τ 2 |γ) = Ga(τd2 |1, )
2
p(wd |τd2 ) = N (wd |0, τd2 )
Hence:
D
( ) D
2 1 X 2 2 wd2 Y 1
p(τ , w, y|X, γ) ∝ exp − (γ τd + 2 ) ·
2 d=1 τd d=1
τd
N
Y
· Φ(wT xn )yn (1 − Φ(wT xn ))1−yn
n=1
new old
In Q(θ ,θ ), we take expectation of θold . We have assumed w as
parameter and τ 2 as latent variable, thus:
Q(w, wold ) = Ewold [log p(y, τ 2 |w)]
Now extract terms involve w from log p(τ 2 , w, y):

D N
2 1 X wd2 X
log p(y, τ |w) = c − + yn log Φ(wT xn ) + (1 − yn )(1 − Φ(wT xn ))
2 d=1 τd2 n=1
Thus we only need to calculate one expectation in E-step:

1 old
E[ |w ]
τd2
Which can be done as in 13.4.4.3, because Probit and linear regression
share the same PGM up to this stage.
The M-step is the same as Gaussian-prior Probit regression hence omit-
ted.
13.10 GSM representation of group lasso*

Follow the hints and straightforward algebra.
13.11 Projected gradient descent for l1 regularized least

squares
Generally, we take gradient on w and optimize. When there are con-
strains on w that could be broken by gradient descent, the increment has
to be moderated to fit in the constrains.
To calculate:
minw {N LL(w) + λ||w||1 }
Consider under a linear regression context:

1
N LL(w) = ||y − Xw||22
2
For λ||w||1 can not be differentiate, we need a non-trivial solution, it
is suggest:
w=u−v
ui = (xi )+ = max {0, xi }
vi = (−xi )+ = max {0, −xi }
With u ≥ 0, v ≥ 0, then:
||w||1 = 1Tn u + 1Tn v
The original problem is changed to:

1 2 T T
minw ||y − X(u − v)||2 + λ1n u + λ1n v
2
s.t.u ≥ 0, v ≥ 0
Denote: !
u
z=
v
Rewrite the original target:

1
minz f (z) = cT z + zT Az
2
s.t.z ≥ 0
Where: !
λ1n − yX
c=
λ1n + yX
!
XT X −XT X
A=
−XT X XT X
The gradient is given by:
∇f (z) = c + Az
For ordinary gradient descent:
zk+1 = zk − α∇f (zk )
For projected case, take gk :
gki = min zki , α∇f (zk )i

During iteration:
zk+1 = zk − gk
The original paper suggest more delicate method to moderate the learn-
ing rate, refer to ”Gradient Projection for Sparse Reconstruction: Applica-
tion to Compressed Sensing and Other Inverse Problems, Mario A.T.Figueiredo”.
13.12 Subderivative of the hinge loss function
if (θ < 1)∂f (θ) = {−1}
if (θ = 1)∂f (θ) = [−1, 0]
if (θ > 1)∂f (θ) = {0}
13.13 Lower bounds to convex functions

Refer to ”Rigorous Affine Lower Bound Functions for Multivariate
Polynomials and Their Use in Global Optimisation”.
14 KERNELS 78
14 Kernels
15 GAUSSIAN PROCESSES 79
15 Gaussian processes
15.1 Reproducing property
We denote κ(x1 , x) by f (x) and κ(x2 , x) by g(x). From definition:
∞
X
f (x) = fi φ(x)
i=1
∞
X
κ(x1 , x) = λi φi (x1 )φi (x)
i=1
Since x can be chosen arbitrarily, we have the properties hold(the one

for g is obtained similarly):
fi = λi φi (x1 )
gi = λi φi (x2 )
Therefore:
< κ(x1 , .), κ(x2 , .) >= < f, g >

∞
X fi gi
=
i=1
λi
X∞
= λi φi (x1 )φi (x2 )
i=1
=κ(x1 , x2 )
16 ADAPTIVE BASIS FUNCTION MODELS 80
16 Adaptive basis function models

16.1 Nonlinear regression for inverse dynamics
17 MARKOV AND HIDDEN MARKOV MODELS 81
17 Markov and hidden Markov models

17.1 Derivation of Q function for HMM
Firstly, we estimate the distribution of z1:T w.r.t θold , for auxiliay func-
tion, we are to calculate the log-likelihood w.r.t θ and z1:T .
Q(θ, θold ) =Ep(z1:T |x1:T ,θold ) [log p(z1:T , x1:T |θ)]

(N ( Ti Ti
))
Y Y Y
=Ep [log p(zi,1 |π) p(zi,t |zi,t−1 , A) p(xi,t |zi,t , B) ]
i=1 t=2 t=1
N X
X K X Ti X
N X K X
K
=Ep [ I[zi,1 = k] log πk + I[zi,t = k, zi,t−1 = j] log A(j, k)
i=1 k=1 i=1 t=2 j=1 k=1
X Ti X
N X K
+ I[zi,t = k] log p(xi,t |zi,t = k, B)]
i=1 t=1 k=1
Further we have 17.98, 17.99, 17.100, using the definition of expectation

yields to 17.97.
17.2 Two filter approach to smoothing in HMMs

For rt (i) = p(zt = i|xt+1:T ), we have:
X
p(zt = i|xt+1:T ) = p(zt = i, zt+1 = j|xt+1:T )
j
X
= p(zt+1 = j|xt+1:T )p(zt = i|zt+1 = j, xt+1:T )
j
X
= p(zt+1 = j|xt+1:T )p(zt = i|zt+1 = j)
j
X
= p(zt+1 = j|xt+1:T )Ψ− (j, i)
j
−
Where Ψ denotes the transform matrix in an inverse sense, we further
have:
p(zt+1 = j|xt+1:T ) =p(zt+1 = j|xt+1 , xt+2:T )

∝p(zt+1 = j, xt+1 , xt+2:T )
=p(xt+2:T )p(zt+1 = j|xt+2:T )p(xt+1 |zt+1 = j, xt+2:T )
∝rt+1 (j)φt+1 (j)
Therefore we can calculate rt (i) recursively:

X
rt (i) ∝ rt+1 (j)φt+1 (j)Ψ− (j, i)
j
Q
And initial element p(zT ) is given by T (i).
To rewrite γt (i) in terms of new factors:
γt (i) ∝p(zt = i|x1:T )

∝p(zt = i, x1:T )
=p(zt = i)p(x1:T |zt = i)
=p(zt = i)p(x1:t |zt = i)p(xt+1:T |zt = i, x1:t )
=p(zt = i)p(x1:t |zt = i)p(xt+1:T |zt = i)
1
= p(x1:t , zt = i)p(xt+1:T , zt = i)
p(zt = i)
1
∝ p(zt = i|x1:t )p(zt = i|xt+1:T )
p(zt = i)
αt (i) · rt (i)
= Q
t (i)
17.3 EM for HMMs with mixture of Gaussian observa-

tions
Using mixture of Gaussians as the emission distribution does not the
evaluation of γ and , hence the E-step does not change compared to the
one in exercise 17.1.
As long as A and B are estimated independently, we are now focus on
estimating B = (π, µ, Σ) during M-step, the involved target function is:
K X
X Ti
N X
γi,t (k) log p(xi,t |B)
k=1 i=1 t=1
Since the parameters are independent w.r.t k, we delve into a case

where k is given. We also denote the iteration through i = 1 to N and t = 1
PN
to Ti by n = 1 to T = i=1 Ti , now the log-likelihood takes the form:
T
X
γn (k) log p(xn |πk , µk , Σk )
n=1
It can be seen as a weighted form of log-likelihood for a mixture of

Gaussian, assume the mixture contains C(it should be Ck , but this notation
causes no contradiction as long as we take k for granted) Gaussians. We are
to apply another EM procedure during the M-step for this HMM. Denote
the latent variable corresponding to xn by hn,k . Estimate the distribution
of p(hn,k |zn , πk , µk , Σk ) is tantamount to the E-step used in handling tradi-
tional mixture of Gaussians. Denote the expectation of hn,k ’s components
0
by γc,n (k).
Now applying the M-step of mixture of Gaussians, recall that auxiliay
takes the form:
T
X C
X
0
γn (k) γc,n (k) {log πk,c + log N (xn |µk,c , Σk,c )}
n=1 c=1
Hence this HMM reweighted a traditional mixture of Gaussians, with

0 0
the weight changed from γc,n (k) into γn (k) · γc,n (k). The rest estimation
is trivially the application of M-step in mixture of Gaussians using new
weights.
17.4 EM for HMMs with tied mixtures

Recall the conclusion from exercise 17.3, the last M-step inside M-step
takes the form:
K X
X T X
C
γc,n (k) {log πk,c + log N (xn |µc , Σc )}
k=1 n=1 c=1
Where we accordingly update the meaning of γ, and we also remove k

from the footnotes of µ and Σ given the conditions in this exercise.
It is easy to notice that this target function again takes the form of
M-step target for a traditional mixture of Gaussians. Taking independent
k and update πk gives the learning process of K mixing weights. Sum out
k and C independent Gaussian parameters can be updated.
18 STATE SPACE MODELS 84
18 State space models

18.1 Derivation of EM for LG-SSM
We directly work on the auxiliary function:
N
Y
Q(θ, θold ) =Ep(Z|Y,θold ) [log p(zn,1:Tn , yn,1:Tn |θ)]
n=1
N
X Tn
Y Tn
Y
=E[ log p(zn,1 ) p(zn,i |zn,i−1 ) p(yn,i |zn,i )]
n=1 i=2 i=1
N
X Tn
X
=E[ log N (zn,1 |µ0 , Σ0 ) + N (zn,i |Ai zn,i−1 + Bi ui , Qi )
n=1 i=2
Tn
X
+ N (yn,i |Ci zn,i + Di ui , Ri )]
i=1
( N
)
1 1X T −1
=E[N log 1 + − (zn,1 − µ0 ) Σ0 (zn,1 − µ0 )
|Σ0 | 2 2 n=1
T
X 1
+ Ni log 1
i=2
|Qi | 2
( Ni
)
1X T −1
+ − (zn,i − Ai zn,i−1 − Bi ui ) Qi (zn,i − Ai zn,i−1 − Bi ui ) ]
2 n=1
T
X 1
+ Ni log 1
i=2
|Ri | 2
( Ni
)
1X T −1
+ − (yn,i − Ci zn,i − Di ui ) Ri (yn,i − Ci zn,i − Di ui ) ]
2 n=1
When exchanging the order of sum over data, we have T = maxn {Tn }
and Ni denotes the number of data set with size no more than i.
To estimate µ0 , take the related terms:
N
1X
E[− (zn,1 − µ0 )Σ−1
0 (zn,1 − µ0 )]
2 n=1
Take derivative w.r.t µ0 :

N
X 1
E[ − µT0 Σ−1 −1
0 µ0 + zn,1 Σ0 µ0 ]
n=1
2
18 STATE SPACE MODELS 85
Setting it to zero yields:

1
µ0 = E[zn,1 ]
N
It is obvious that such estimation is similar to that for MVN with xn
replaced by E[zn,1 ]. This similarity works for other parameters as well. For
example, estimate Σ0 is tantamount to estimate the covariance of MVN
with data terms replaced.
Such analysis works for Qi and Ri as well. To estimate coefficient
matrix, we consider Ai firstly. The related term is:
Ni
X
T
ATi Q−1 T T −1

E[ zn,i i Ai zn,i − 2zn,i−1 Ai Qi (zn,i − Bi ui ) ]
n=1
Setting derivative to zero yields a solution similar to that for µ0 , the

same analysis can be applied for Bi , Ci , Di as well.
18.2 Seasonal LG-SSM model in standard form

From Fig.18.6(a), we have:
 
1 1 0 0TS−1

 0 T 
1 0 0S−1

A =  0 T

 0 1 0S−1  
0S−1 0S−1 I 0S−1
 
Qa 0TS+1
Qb 0TS
 
 0 
Q =  
 0 0 Q 0TS−1 

0(S−1)∗(S+2)

C = 1 1 1 0TS−1
Where we use 0n to denote a colomn vector of 0 with length n, and

0m∗n to denote a m ∗ n matrix of 0.
19 UNDIRECTED GRAPHICAL MODELS(MARKOV RANDOM FIELDS)86
19 Undirected graphical models(Markov

random fields)
19.1 Derivation of the log partition function
According to the definition:
XY
Z(θ) = ψc (yc |θc )
y c∈C
It is straightforward to give:
∂ log Z(θ) ∂ XY
= log ψc (yc |θc )
∂θc0 ∂θc0 y c∈C
1 X ∂ Y
= ψc (yc |θc )
Z(θ) y ∂θc0 c∈C
1 X Y ∂
= ψc (yc |θc ) ψc0 (yc0 |θc0 )
Z(θ) y c∈C,c6=c0 ∂θc0
1 X Y ∂
exp θcT0 φc0 (yc0 )

= ψc (yc |θc )
Z(θ) y c∈C,c6=c0 ∂θc0
1 XY
= ψc (yc |θc )φc0 (yc0 )
Z(θ) y c∈C
X 1 Y
= φc0 (yc0 ) ψc (yc |θ)
y
Z(θ) c∈C
X
= φc0 (yc0 )p(y|θ)
y
=E[φc0 (yc0 )|θ]
19.2 CI properties of Gaussian graphical models

Problem a:
We have:  
0.75 0.5 0.25
 
Σ=
 0.5 1.0 0.5 

0.25 0.5 0.75
And:  
2 −1 0
Λ = Σ−1 = 
 
−1 2 −1

0 −1 2
Thus we have independency: X1 ⊥ X2 |X3 . This introduces a MRF
like:
X1 X3 X2
Problem b: The inverse of Σ contains no zero element, hence no con-

ditional independency. Therefore there have to be edges between any two
vertexes.
X1 X3
X2
This model also cancels the marginal independency X1 ⊥ X3 . But it

is possible to model this set of properties by Bayesian network with two
directed edges X1 → X2 and X3 → X2 .
Problem c: Consider the terms inside the exponential:
1 2
x1 + (x2 − x1 )2 + (x3 − x22 )

−
2
It is easy to see the precision matrix and covariance matrix take:
   
2 −1 0 1 1 1
   
Λ= −1 2 −1 , Σ = 1 2 2
 
0 −1 1 1 2 3
Problem d: The only independency is X1 ⊥ X3 |X2 :
X1 X2 X3
19.3 Independencies in Gaussian graphical models

Problem a and b:
This PGM implies X1 ⊥ X3 |X2 , hence we are looking for a precision
matrix with Λ1,3 = 0, thus C and D meet the condition. On the other hand,
(A−1 )1,3 = (B −1 )1,3 = 0. So A and B are candidates for covariance matrix.
Problem c and d:
This PGM tells that X1 ⊥ X3 . Hence C and D can be covariance
matrix, A and B can be precision matrix.
The only possible PGM is:
X1 X2 X3
Problem e:
The answer can be derived from the conclusion of marginal Gaussian
directly, A is true while B not.
19.4 Cost of training MRFs and CRFs

The answer are generally:
O(r(N c + 1))
and
O(r(N c + N ))
19.5 Full conditional in an Ising model

Straightforwardly(we have omitted θ from condition w.l.o.g):
p(xk = 1, x−k )
p(xk = 1|x−k ) =
p(x−k )
p(xk = 1, x−k )
=
p(xk = 0, x−k ) + p(xk = 1, x−k )
1
= p(xk =0,x−k )
1 + p(xk =1,x−k )
1
= Q
exp(hk ·0) <k,i> exp(Jk,i ·0)
1+ Q
exp(hk ·1) <k,i> exp(Jk,i ·xi )
n
X
=σ(hk + Jk,i xi )
i=1,i6=k
When using denotation x = {0, 1}, the full conditional becomes:

n
X
p(xk = 1|x−k )σ(2 · (hk + Jk,i xi ))
i=1,i6=k
20 EXACT INFERENCE FOR GRAPHICAL MODELS 90
20 Exact inference for graphical models

20.1 Variable elimination
Where tf is the figure?!
20.2 Gaussian times Gaussian is Gaussian

We have:
N (x|µ1 , λ−1 −1
1 ) × N N (x|µ2 , λ2 )
√
λ1 λ2 λ1 λ2
= exp − (x − µ1 )2 − (x − µ2 )2
2π 2 2
√
λ1 µ21 + λ2 µ22

λ1 λ2 λ1 + λ2 2
= exp − x + (λ1 µ1 + λ2 µ2 )x −
2π 2 2
By completing the square:
λ1 µ21 + λ2 µ22

λ1 + λ2 2
exp − x + (λ1 µ1 + λ2 µ2 )x −
2 2
λ
=c · exp − (x − µ)2
2
Where:
λ = λ1 + λ2
µ = λ−1 (λ1 µ1 + λ2 µ2 )
The constant factor c can be obtained by computing the constant terms

inside the exponential.
20.3 Message passing on a tree

Problem a:
It is easy to see after variable elimination:
XX
p(X2 = 50) = p(G1 )p(G2 |G1 )p(X2 = 50|G2 )
G1 G2
X
p(G1 = 1, X2 = 50) = p(G1 ) p(G2 |G1 = 1)p(X2 = 50|G2 )
G2
Thus:
0.45 + 0.05 · exp(−5)
p(G1 = 1|X2 = 50) = ≈ 0.9
0.5 + 0.5 · exp(−5)
Problem b(here X denotes X2 or X3 ):
p(G1 = 1|X2 = 50, X3 = 50)

p(G1 = 1, X2 = 50, X3 = 50)
=
p(X2 = 50, X3 = 50)
p(G1 = 1)p(X2 |G1 = 1)p(X3 |G1 = 1)
=
p(G1 = 0)p(X2 |G1 = 0)p(X3 |G1 = 0) + p(G1 = 1)p(X2 |G1 = 1)p(X3 |G1 = 1)
p(X = 50|G1 = 1)2
=
p(X = 50|G1 = 0)2 + p(X = 50|G1 = 1)2
0.92
≈ 2 ≈ 0.99
0.1 + 0.92
Extra evidence makes the belief in G1 = 1 firmer.
Problem c:
The answer to problem c is symmetric to that to problem b, p(G1 =
0|X2 = 60, X3 = 60) ≈ 0.99.
Problem d:
Using the same pattern of analysis from Problem b, we have:
p(G1 = 1|X2 = 50, X3 = 60)

p(X = 50|G1 = 1)p(X = 60|G1 = 1)
=
p(X = 50|G1 = 0)p(X = 60|G1 = 0) + p(X = 50|G1 = 1)p(X = 60|G1 = 1)
Notice we have:
p(X = 50|G1 = 1) = p(X = 60|G1 = 0)
p(X = 50|G1 = 0) = p(X = 60|G1 = 1)
Hence:
P (G1 = 1|X2 = 50, X3 = 60) = 0.5
In this case, X2 and X3 have equal strength as evidence and their

effects achieve a balance so they provide not enough information to distort
the prior knowledge.
20.4 Inference in 2D lattice MRFs

Please refer to PGM:principals and techniques 11.4.1.
21 VARIATIONAL INFERENCE 93
21 Variational inference
21.1 Laplace approximation to p(µ, log σ|D) for a univari-
ate Gaussian
Laplace approximation equals representing f (µ, l) = log p(µ, l = log σ|D)
with second-order Taylor expansion. We have:
log p(µ, l|D) = log p(µ, l, D) − log p(D)

= log p(µ, l) + log p(D|µ, l) + c
= log p(D|µ, l) + c
N
X 1 1
= log √ exp − 2 (yn − µ)2 + c
n=1 2πσ 2 2σ
N
X 1
= − N log σ + − (yn − µ)2 + c
n=1
2σ 2
N
1 1 X
=−N ·l+ (yn − µ)2 + c
2 exp {2 · l} n=1
Thus we derive:
N
∂ log p(µ, l|D) 1 1 X
= 2 · (yn − µ)
∂µ 2 exp {2 · l} n=1
N
= · (ȳ − µ)
σ2
N
∂ log p(µ, l|D) 1X 1
=−N + (yn − µ)2 · (−2) ·
∂l 2 n=1 exp {2 · l}
N
1 X
=−N + (yn − µ)2
σ 2 n=1
∂ 2 log p(µ, l|D) N
2
=− 2
∂µ σ
N
∂ 2 log p(µ, l|D) 2 X
= − (yn − µ)2
∂l2 σ 2 n=1
∂ 2 log p(µ, l|D) 1
=N · (ȳ − µ) · (−2) · 2
∂µ∂l σ
For approximation, p(µ, l) ≈ N (µ, Σ) with:

!−1
∂ 2 log p(µ,l|D) ∂ 2 log p(µ,l|D)
∂µ2 ∂l2
Σ= ∂ 2 log p(µ,l|D) ∂ 2 log p(µ,l|D)
∂l2 ∂µ∂l
!
∂ log p(µ,l|D)
∂µ
µ=Σ ∂ log p(µ,l|D)
∂l
21.2 Laplace approximation to normal-gamma

This is the same with exercise 21.1 when the prior is uniformative. We
formally substitute:
N
X N
X
2
(yn − µ) = ((yn − ȳ) − (µ − ȳ))2
n=1 n=1
N
X N
X N
X
2 2
= (yn − ȳ) + (µ − ȳ) + 2(µ − ȳ) · (yn − ȳ)
n=1 n=1 n=1
=N s2 + N (µ − ȳ)2
PN
Where s2 = N1 n=1 (yn − ȳ)2
Conclusions in all problems a, b and c are included in the previous
solution.
21.3 Variational lower bound for VB for univariate Gaus-

sian
What left in section 21.5.1.6 is the derivation for 21.86 to 21.91. We
omit the derivation for entropy for Gaussian and moments, which can be
found in any information theory textbook. Now we derive the E[ln x|x ∼
Ga(a, b)], which can therefore yields to the entropy for a Gamma distribu-
tion.
We know that Gamma distribution is an exponential family distribu-
tion:
ba a−1
Ga(x|a, b) = x exp {−b · x}
Γ(a)
∝ exp {−b · x + (a − 1) ln x}
= exp φ(x)T θ

The sufficient statistics is φ(x) = (x, ln x)T and natural parameter is

given by θ = (−b, a − 1)T . Thus Gamma distribution can be seen as the
maximum entropy distribution under constraints on x and ln x.
The culumant function is given by:
A(θ) = log Z(θ)

Γ(a)
= log
ba
= log Γ(a) − a log b
The expectation of sufficient statistics is given by the derivative of

cumulant function, therefore:
∂A Γ0 (a)
E[ln x] = = − log b
∂(a − 1) Γ(a)
Γ0 (a)
According to defintion ψ(a) = Γ(a)
:
E[ln x] = ψ(a) − log b
The rest derivations are completed or trivial.
21.4 Variational lower bound for VB for GMMs

The lower bound is given by:
p(θ, D)
Eq [log ] =Eq [log p(θ, D)] − Eq [q(θ)]
q(θ)
=Eq [log p(D|θ)] + Eq [log p(θ)] + Eq [log q(θ)]
=E[log p(x|z, µ, Λ, π)] + E[log p(z, µ, Λ, π)]
− E[log q(z, µ, Λ, π)]
=E[log p(x|z, µ, Λ, π)] + E[log p(z|π)] + E[log p(π)] + E[log p(µ, Λ)]
+ E[log q(z)] + E[log q(π)] + E[log q(µ, Λ)]
We are now showing 21.209 to 21.215.

For 21.209:
E[log p(x|z, µ, Λ)] =Eq(z)q(µ,Λ) [log p(x|z, µ, Λ)]

XX D 1 1
= Eq(z)q(µ,Λ) [− log 2π + log |Λk | − (xn − µk )T Λk (xn − µk )]
n k
2 2 2
Using 21.132 and converting summing by average x̄k yields to solution.

For 21.210:
E[log p(z|π)] =Eq(z)q(π) [log p(z|π)]

N Y
Y K
=Eq(z)q(π) [log πkznk ]
n=1 k=1
N X
X K
= Eq(z)q(π) [znk log πk ]
n=1 k=1
N X
X K
= Eq(z) [znk ]Eq(π) [log πk ]
n=1 k=1
N X
X K
= rnk log π̄k
n=1 k=1
For 21.211:
E[log p(π)] =Eq(π) [log p(π)]

K
Y
=Eq(π) [log(C · πkα0 −1 )]
k=1
K
X
= ln C + (α0 − 1) log π̄k
k=1
For 21.212:
E[log p(µ, Λ)] =Eq(µ,Λ) [log p(µ, Λ)]

K
Y
=Eq(µ,Λ) [log W i(Λk |L0 , v0 ) · N (µk |m0 , (β0 Λk )−1 ]
k=1
K
X 1 1
Eq(µ,Λ) [log C + (v0 − D − 1) log |Λk | − tr Λk L−1

= 0
k=1
2 2
D 1 1
− log 2π − log |β0 Λk | − (µk − m0 )T (β0 Λk )(µk − m0 )]
2 2 2
Using 21.131 to expand the expected value of the quadratic form and
using the fact that the mean of a Wi distribution is vk Lk and we are done.
For 21.213:
E[log q(z)] =Eq(z) [log q(z)]

XX
=Eq(z) [ zik log rik ]
i k
XX
= Eq(z) [zik ] log rik
i k
XX
= rik log rik
i k
For 21.214:
E[log q(π)] =Eq(π) [log q(π)]

K
X
=Eq(π) [log C + (αk − 1) log πk ]
k=1
X
= log C + (αk − 1) log π̄k
k
For 21.215:
E[log q(µ, Λ)] =Eq(µ,Λ) [log q(µ, Λ)]

X D 1
= Eq(µ,Λ) [log q(Λk ) − log 2π + log |βk Λk |
k
2 2
1
− (µk − mk )T (βk Λk )(µk − mk )]
2
Using 21.132 to expand the quadratic form to give E[(µk −mk )T (βk Λk )(µk −
mk )] = D
21.5 Derivation of E[log πk ]

under a Dirichlet distribution Dirichlet distribution is an exponential
family distribution, we have:
φ(π) = (log π1 , log π2 , ... log πK )
θ=α
The cumulant function is:

K
X XK
A(α) = log B(α) = log Γ(αi ) − log Γ( αi )
i=1 i=1
And:
PK K
∂A(α) Γ0 (αk ) Γ0 ( i=1 αk ) X
E[log πk ] = = − PK = ψ(αk ) − ψ( αi )
∂αk Γ(αk ) Γ( i=1 αk ) i=1
Take exponential on both sides:

K
X exp(αk )
exp(E[log πk ]) = exp(ψ(αk ) − ψ( αk )) = PK
i=1 exp( i=1 αi )
21.6 Alternative derivation of the mean field updates for

the Ising model
This is no different than applying the procedure in section 21.3.1 before
derivating updates, hence omitted.
21.7 Forwards vs reverse KL divergence

We have:
p(x, y)
KL(p(x, y)||q(x, y)) =Ep(x,y) [log ]
q(x, y)
X X X
= p(x, y) log p(x, y) − p(x, y) log q(x) − p(x, y) log q(y)
x,y x,y x,y
X XX X X
= p(x, y) log p(x, y) − ( p(x, y)) log q(x) − y( p(x, y)) log q(q)
x,y x y x
=H(p(x, y)) − H(p(x)) − H(p(y)) + KL(p(x)||q(x)) + KL(p(y)||q(y))

=constant + KL(p(x)||q(x)) + KL(p(y)||q(y))
Thus the optimal approximation is q(x) = p(x) and q(y) = p(y).

We skip the practical part.
21.8 Derivation of the structured mean field updates for

FHMM
According to the conclusion from mean-field varitional methods, we
have:
E(xm ) = Eq/m [E(p̄(xm ))]
Thus:
K
T X T M M
X 1 X X
T −1
X
− xt,m,k ˜t,m,k = E[ (yt − Wl xt,m ) Σ (yt − Wl xt,m )] + C
t=1 k=1
2 t=1 l6=m l6=m
Comparing the coefficient of xt,m,k (i.e. setting xt,m,k to 1) ends in:
T −1
X 1 T −1
˜t,m,k = Wm Σ (yt − Wl E[xt,l ]) − (Wm Σ Wm )k,k
l6=m
2
Write into matrix form yields to 21.62.
21.9 Variational EM for binary FA with sigmoid link

Refer to ”Probabilistic Visualisation of High-Dimensional Binary Data,
Tipping, 1998”.
21.10 VB for binary FA with probit link

The major difference in using probit link is the uncontinuous likelihood
caused by p(yi = 1|zi ) = I(zi > 0). In the context of hiding X, we assume
Gaussian prior on X, W and Z. The approximation takes the form:
L
Y N
Y
q(X, Z, W) = q(wl ) q(xi )q(zi )
l=1 i=1
It is a mean-field approximation, hence in an algorithm similari to EM,

we are to update the distribution of X, Z and W stepwise.
For variable X, we have:
log q(xi ) =Eq(zi )q(w) [log p(xi , w, zi , yi )]

=Eq(zi )q(w) [log p(xi ) + log p(w) + log p(zi |wi , w) + log p(yi |zi )]
Given the likelihood form, for i corresponding to yi = 1, q(zi ) have

to be a truncated one, i.e. we only consider the expectations in the form
E[z|z > µ] and E[z 2 |z > µ].
log q(xi ) = − 21 xTi Λ1 xi − 12 E[z 2 ] − 21 xTi E[wwT ]xi + E[z]E[w]T xi
Where Λ1 is the covariance of xi ’s prior distribution, E[wwT ] can be
calculated given the Gaussian form of q(w), and truncated expectations E[z]
and E[z 2 ] can be obtained from solutions to exercise 11.15. It is obvious

that q(xi ) is a Gaussian.
The update for w is similar to that for xi as long as they play symmetric
roles in likelihood. The only difference is we have to sum over i when
updating w.
At last we update zi :
log q(zi ) = Eq(xi )q(w) [log p(zi |xi , w) + log p(yi |zi )]
Inside the expectation we have:

1
− zi2 + E[w]T E[x]zi + c
2
Therefore q(zi ) again takes a Gaussian form.

Murphy Book Solution

Uploaded by

Copyright:

Available Formats

Murphy Book Solution

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Murphy Book Solution

Uploaded by

Copyright:

Available Formats

What are some important probability concepts covered?

What are some important probability concepts covered?

What types of generative models are discussed for discrete data?

What types of generative models are discussed for discrete data?

Machine Learning: A Probabilistic

Perspective Solution Manual Version 1.1

2.15 MLE minimizes KL divergence to the empirical distribution . 11

3 Generative models for discrete data 13

4.3 Correlation coefficient is between -1 and 1 . . . . . . . . . . . 22

9 Generalized linear models and the exponential family 46

10 Directed graphical models(Bayes nets) 48

11 Mixture models and the EM algorithm 49

11.3 EM for mixtures of Bernoullis . . . . . . . . . . . . . . . . . . 50

12 Latent linear models 59

13 Sparse linear models 64

16 Adaptive basis function models 74

17 Markov and hidden Markov models 75

18 State space models 78

19 Undirected graphical models(Markov random fields) 80

20 Exact inference for graphical models 84

1.2 On Machine Learning: A Probabilistic Perspective

To be a completed theory, machine learning is still looking for a way

1.3 What is this document?

1.4 Updating log

2.2 Legal reasoning

2.3 Vriance of a sum

2.4 Bayes rule for medical diagnosis

2.5 The Monty Hall problem(The dilemma of three doors)

2.6 Conditional Independence

Thus the answer is (ii).

So both (i) and (ii) and sufficient obviously. Since:

(iii) is sufficint as well since we can calculate p(e1 , e2 ) independently.

2.7 Pairwise independence does not imply mutual inde-

2.8 Conditional independence iff joint factorizes

We have the first half of proof.

2.9 Conditional independence*

2.10 Deriving the inverse gamma density

2.11 Normalization constant for a 1D Gaussian

2.12 Expressing mutual information in terms of entropies

Inversing X and Y yields to another formula..

2.13 Mutual information for correlated normals

I(X1 ; X2 ) =H(X1 ) − H(X1 |X2 )

2.14 A measure of correlation

2.15 MLE minimizes KL divergence to the empirical dis-

θ = arg min {KL(pemp ||q(θ))}

2.16 Mean, mode, variance for the beta distribution

Where we have used the property of Gamma function. Straightforward

2.17 Expected value of the minimum

p(m > x) =p([X > x]and[Y > x])

3 Generative models for discrete data

Setting the derivative to zero:

3.2 Marginal likelihood for the Beta-Bernoulli model

p(θ|a, b) = Beta(θ|a, b) = θa−1 (1 − θ)b−1

p(θ|D) ∝p(D|θ) · p(θ|a, b)

Calcualte p(D) where D = 1, 0, 0, 1, 1:

p(D) =p(x1 )p(x2 |x1 )p(x3 |x2 , x1 )...p(XN |xN −1 , XN −2 , ...X1 )

3.3 Posterior predictive for Beta-Binomial model

3.4 Beta updating from censored likelihood

p(θ, X < 3) =p(θ)p(X < 3|θ)