Stochastic - Lecture Notes

Stochastic processes without measure theory
Byron Schmuland
I returned, and saw under the sun, that the race is not to the swift, nor
the battle to the strong, neither yet bread to the wise, nor yet riches to
men of understanding, nor yet favour to men of skill; but time and chance
happeneth to them all. Ecclesiastes 9:11.
Contents
1 Finite Markov Chains 1

A Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . 1
B Calculating probabilities . . . . . . . . . . . . . . . . . . . . 6
C Invariant Probabilities . . . . . . . . . . . . . . . . . . . . . 8
D Classification of states . . . . . . . . . . . . . . . . . . . . . 16
E Hitting times . . . . . . . . . . . . . . . . . . . . . . . . . . 20
F Column vectors . . . . . . . . . . . . . . . . . . . . . . . . . 28
2 Countable Markov Chains 29

A Recurrence and transience . . . . . . . . . . . . . . . . . . . 29
B Difference equations and Markov chains on Z . . . . . . . . 31
C Two types of recurrence . . . . . . . . . . . . . . . . . . . . 34
D Branching processes . . . . . . . . . . . . . . . . . . . . . . 35
3 Optimal Stopping 39
A Strategies for winning . . . . . . . . . . . . . . . . . . . . . 39
B Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
C Algorithm to find optimal strategy . . . . . . . . . . . . . . 49
D Two variations . . . . . . . . . . . . . . . . . . . . . . . . . 50
E The binomial pricing model . . . . . . . . . . . . . . . . . . 51
4 Martingales 57
A Conditional Expectation . . . . . . . . . . . . . . . . . . . . 57
B Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
C Optional sampling theorem . . . . . . . . . . . . . . . . . . 65
D Martingale convergence theorem . . . . . . . . . . . . . . . 70
5 Continuous time Markov chains 74

A Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . 74
B Continuous time chains with finite state space . . . . . . . . 76
6 Brownian motion 79
A Basic properties . . . . . . . . . . . . . . . . . . . . . . . . . 79
B The reflection principle . . . . . . . . . . . . . . . . . . . . 81
C The Dirichlet problem . . . . . . . . . . . . . . . . . . . . . 83
7 Stochastic integration 88
A Integration with respect to random walk . . . . . . . . . . . 88
B Integration with respect to Brownian motion . . . . . . . . 89
C Ito’s formula . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8 Appendix 94
A Strong Markov property . . . . . . . . . . . . . . . . . . . . 94
B Matrix magic . . . . . . . . . . . . . . . . . . . . . . . . . . 99
C The algorithm from section 3C . . . . . . . . . . . . . . . . 101
1 FINITE MARKOV CHAINS 1
1 Finite Markov Chains
A Basic definitions
Let (Xn )∞
n=0 be a stochastic process taking values in a state space S that
has N states. To understand the behaviour of this process we will need to
calculate probabilities like
P(X0 = i0 , X1 = i1 , . . . , Xn = in ).
This can be computed by multiplying conditional probabilities as follows:
P(X0 = i0 ) P(X1 = i1 j X0 = i0 ) P(X2 = i2 j X1 = i1 , X0 = i0 )

× · · · × P(Xn = in j Xn−1 = in−1 , . . . , X1 = i1 , X0 = i0 ).
Example A-1 We randomly select playing cards from an ordinary deck.

The state space is S = fRed, Blackg. Let’s calculate the chance of observing
the sequence RRB using two different sampling methods.
(a) Without replacement
P(X0 = R, X1 = R, X2 = B)
= P(X0 = R) P(X1 = R j X0 = R) P(X2 = B j X1 = R, X0 = R)

26 25 26
=
52 51 50
≈ .12745.
(b) With replacement
P(X0 = R, X1 = R, X2 = B)
= P(X0 = R) P(X1 = R j X0 = R) P(X2 = B j X1 = R, X0 = R)

26 26 26
=
52 52 52
≈ .12500.
Definition A-1 The process (Xn )∞ n=0 is called a Markov chain if, for any
n and any collection of states i0 , i1 , . . . , in+1 we have
P(Xn+1 = in+1 j Xn = in , . . . , X1 = i1 , X0 = i0 ) = P(Xn+1 = in+1 j Xn = in ).
For a Markov chain the future depends only on the current state and not
on past history.
Exercise. In example A-1, calculate P(X2 = B j X1 = R) and confirm that

only “with replacement” do we get a Markov chain.
Definition A-2 A Markov chain (Xn )∞

n=0 is called time homogenous if, for
any i, j ∈ S we have
P(Xn+1 = j j Xn = i) = p(i, j),
for some function p : S × S → [0, 1].
From now on, we will assume that all our Markov chains are time homoge-
neous. The probabilities for a Markov chain are computed using the initial
probabilities φ0 (i) = P(X0 = i) and the transition probabilities p(i, j):
P(X0 = i0 , X1 = i1 , . . . , Xn = in ) = φ0 (i0 ) p(i0 , i1 ) p(i1 , i2 ) · · · p(in−1 , in ).
We will often be interested in probabilities conditional on the starting po-

sition, and will write Pi (A) instead of P(A j X0 = i). In a similar vein, we
will write conditional expected values as Ei (X) instead of E(X j X0 = i).
The formula above implies the following:
P(X0 = i0 , X1 = i1 , . . . , Xn = in , Xn+1 = in+1 , . . . Xn+m = in+m )
= P(X0 = i0 , X1 = i1 , . . . , Xn = in ) × Pin (X0 = in , X1 = in+1 , . . . , Xm = in+m )
Intuitively, this says that a Markov chain run until time n + m is the same
as the chain stopped at time n, then started anew with initial state Xn and
run for the remaining m time periods.
A more sophisticated version of this formula, where n is replaced by a

stopping time T , is called the strong Markov property. This is the most
powerful tool we have for analyzing Markov chains. We’ll return to this
idea later.
Definition A-3 The transition matrix for the Markov chain (Xn )∞
n=0 is
the N × N matrix P whose (i, j)th entry is p(i, j).
Example A-2 Card colour with replacement
B R
² ³
B 1/2 1/2
P =
R 1/2 1/2
P
Every transition matrix satisfies 0 ≤ p(i, j) ≤ 1 and j∈S p(i, j) = 1. Such
matrices are also called stochastic.
Example A-3 Markov’s two state Markov chain
In a paper written in 1913, A.A. Markov chose a sequence of 20,000 letters

from Pushkin’s Eugene Onegin to see if this sequence can be approximately
considered a simple Markov chain. He obtained the transition matrix
vowel consonant
² ³
vowel .128 .872
P =
consonant .663 .337
Example A-4 Random walk with reflecting boundaries.
The state space is S = f0, 1, . . . , Ng, and at each time the walker jumps to
the right with probability p, or to the left with probability 1 − p.
1 1−p p 1
....... .. ....... ..
..... . ........... ...... ..........
..................................... ............ .......................... .................................... .......................................
...... .. ..... ...... ..... ........... .. ..... ...... .....
..... .... ..... .... ... .... ..... ....
.... .... . ....
• • ...
• • ...
• ...
• • ...
0 1 ... j−1 j j+1 ... N −1 N

That is, p(j, j + 1) = p and p(j, j − 1) = 1 − p, for j = 1, . . . , N − 1. The

boundary conditions are p(0, 1) = 1 and p(N, N − 1) = 1, and p(i, j) = 0
otherwise. In the case p = 1/2, we call this a symmetric random walk.
Example A-5 Random walk with absorbing boundaries.
As above, except with boundary conditions p(0, 0) = 1 and p(N, N ) = 1.
Example A-6 A genetics model.
Imagine a population of fixed size N , with two types of individuals; A

and a. At each time n, every individual has one offspring whose genetic
type (A or a) is randomly chosen according to the empirical distribution of
the current generation. The new generation replaces the old, keeping the
population size fixed at N . The state space of the process is the set of all
probability measures on the type space fA, ag.
The picture below shows three typical sample paths for the process. Here
N = 100 and X0 = (1/2, 1/2). Notice that as time goes on, one of the
genetic types starts to dominate and will eventually become fixed.
Proportion of type A
1.0 ..............................................................................................................................................
...
......... ............
......... .... .......
............
.. .. ..
... .....
..... .......
......
. ..
.
...
...
........
. ... ... ....
. ........ ... ..... .....
......
. . .. ... ..........
.
.. ... .. ...
. .
.
.
.
. . .. . .. .
... ... ...... ... .. ... ..
... .... ..... ... .... ... .. ...........
... .. .. ..... .. .. ............. ..
..
.
...
.
. .. ............... .... ........
.
. .
............ .
... .. ..
.
.... ... .. .... ... .. ... .......... ............ . .
..
... ..... ... ..... ... .... .. . ... ..... .......
.
... .. ............... ........ .... .........
... ..... ... ..... ... .. .. ... . .......... ..... .............. ... ..... ... .. ..... ...
... ... .... .... .. .. ..... ........ ............ ............. . .. ... .... ..... ..... .. ... .. ... ..
.. . . ... .... .... .........
. ................................
. ... .. .
.. . . . . ... .... ....... .... .......... .... ....
.
0.5 . ...... ..
........................... .. ......
.... ... ... ..
... ...
.
.
..... ..
. .
... ........
.
...
.
....... .. .............
.......... ...
.....
. ....
. .
... .. .. ..
... .. ...
....
.. . . .
...
.
... .
........
.
. . .
... ..... .... ....
.. .....
.
...... ..
.....
.
...
... ... . .
.. ................ ....... ....
... .... ..
..... ........ . .
.... .
. ...
. .... .... .. ..
... .. .....
... ..... .. .......... ... . ... ......
... .. ..... .... ....... ............ ........... ... . ...... ....
.... ... ..
... ........... ... ...... . ..... .. .. . ...
... ...
... .... .....
...
. ...... .... .... .....
. ......
.
. .
.. ...
... ... ... . ... ... ... .... .
..... . ....... .... . .............
.
........ .
. .. .
.. ... .. . .
..
....
. ... .. ...
..... ... ......... .. ...... .... .... .... . ... ... .
... ... ...... ......... ...
... ..
... ...
... ...
.....
0.0
0 100
Example A-7 Symmetric vs. Random walk with drift.
Here are some pictures of the first 500 steps of a gambler playing red-
black on a roulette wheel. The symmetric walk (p = 1/2) is the idealized
situation of a fair game, while the walk with drift (p = 9/19) shows the
real situation where the casino takes advantage of the green spots on the
wheel.
Unbiased random walk
...
. .... .. .....
....... ... ...... .......
........... ............. ..... .............
30 ........ ...........
.
. .. .... . .. .
.
... .... . ... . . ..... .. .
.
. . .. ....
.
.. ............ . ... ........................ ........ ............. .......................................... ............................ ............................. .............. .............. ............ .... ....... ..........
.. .
.
.
................ ...... ......... . ........... . .... . .................................... ................. ......... ... ... ...... ... ............. ... . . ..
..... ....... .... . ... ...........................................
.
. .. .. .. ..........
... .... ....... .. .. .. ................... ...................................................
.. .. ..... ......... ........ .................. .. ....... .. ....
..... ..... ....... ....... ............... ....
............ . .... ... ................ .. . ................. .. ................ ....... ......
....... . ........................................ .... ..... . ... ... .. .......
...... . .. ...... ..... ....
10 ..... .
. .
..............
.
... .
.
.......
.....
.
......
...... ............. ...... ... ....... ....... ..........

....................... ..................... ..... ............................... .......... ......
.... .. .......... ............. ... ....... ........ ..
. ....... ............... . . .............. .. ..... ..
... .. ..... ... ............. ........................
............ ................ .........
−10 ... ........... .... ..
....... ........ .......................... ................
. .... ... .. .. ..
. .... ........ ...
......... ..... .. ..... .. ...
.. ..
. ...... ... .......
......................
.... ........
..... ...
... ....... .... ........ . .. .. . ..........
... ..... ........... .....
........ ................ ..
.. .. ..
.... ....... ... .. ... . . .
... ..
......... ..... .... ....
.....
.
−30
0 100 200 300 400 500
Biased random walk

..
.....
......... ..............
..... .. ..
... ....................
....
. ... .. . . ...
... ..
...................................... ....... .. .. ... ...............................
0 ... ...... ................................................... .....
....... ................................. .......
... .. ... ........... ........
...
........................................................ ...............
. . ..... ........ ... . .... .
...
...........
...... ............ .. ......................
...
... .................. ..... ..... ............ ............... ............... ....
...
.................... .......... ... . .. ..... ....... ....... ......... ..... ........
.. ... .... .. ......... ............................. ...... .
.. . ................ ...... ........ ......................... . ..
... ......
............. .. . . ... . .. ..... ..... ... .
......................................... ..
. . .... .................................. ...
.
. . . ..
... .................... ................... .... ....................................................... ..................... .. ... . . ....... ..... ................. .........................
−20 ......... ... .................
.
....
... .......
.
... . ...............
.
. . .
........... ... .........
. .
...... ....
....
. .
.
.
.. .............
.......
. . . . ...
...... ........ . . .. . .
.
....... .. ................. ............ .........................
.
....
. ... ..
....... .
.....
......... ............... ....... .. .. .. .. .... . ....... ... . .... ..
..... ... ........................ ........... .. .. ...........
. .
. ...... .. ....... . .. . . . . . . . . . .
..... ..... ..
.... ........
...
−40 .....
..... ........ ......
.... ...... ...... ...
...........
...
.................. ..... ....
......... . ... ... .... .......
........ ........... .... ...
...
........
−60 ...
... ..................
... ...... . .......
.......... . .. .
.....
−80
0 100 200 300 400 500
The central limit theorem can help us understand the result after 500 plays
of the game. In the symmetric case, it is a tossup whether you are ahead
or behind. In the real situation however, the laws of probability have

practically guaranteed a nice profit for the casino.
............................... .............................
...... ...... ....... ......
..... ..... .........
..... ......... .....
.
......
. ..
..... .....
.....
..
...
. ...
... ...... .....
..... ....
. ..... .....
.. .. ..
..... .....
..
.... .
....
. ...... .....
.....
. ..
.... .. .....
.... ...
. ..... .....
.. .. ...... .....
..
.... ..
..... .. ..... .....
..
...
. ...
... . .....
.... ...
. .
..... .....
..
. ... ...... .....
...
.... .
..
.... ..
. .....
.
...... ..
.....
. .....
........
.....
......
..
..
.... .......
. ... ......
.
...
...... ...
......
. .......
... .......
..
...... .
...... ........ ........
.
...
. ...
. . . .
. ................. ..........
.......
....................... . ..
.... .
...
...
. ........
. .
.
...
.
. .... ..
. .........
..................
..........
......... .
.. ....
-26.31 0
B Calculating probabilities
Consider a two state Markov chain with p = 1/4 and q = 1/6 so that
0 1
² ³
0 3/4 1/4
P = .
1 1/6 5/6
To find the probability that the process follows a certain path, you multiply
the initial probability with conditional probabilities. For example, what is
the chance that the process begins with 01010?
P(X0 = 0, X1 = 1, X2 = 0, X3 = 1, X4 = 0) = φ0 (0) p(0, 1) p(1, 0) p(0, 1) p(1, 0)

= φ0 (0) × 1/576.
As a second example, let’s find the chance that the process begins with
00000.
P(X0 = 0, X1 = 0, X2 = 0, X3 = 0, X4 = 0) = φ0 (0) p(0, 0) p(0, 0) p(0, 0) p(0, 0)

= φ0 (0) × 81/256.
If (as in many situations) we were interested in conditional probabilities,

given that X0 = 0, we simply drop φ0 (0), that is,
1
P0 (X1 = 1, X2 = 0, X3 = 1, X4 = 0) = . (1)
576
81
P0 (X1 = 0, X2 = 0, X3 = 0, X4 = 0) = . (2)
256
Here’s a harder problem. Suppose we want to find P0 (X4 = 0)? Instead

of one path, this includes all possible paths that start in state 0 at time
0, and find themselves in state 0 again at time 4. Then P0 (X4 = 0) is the
sum of (1) and (2) above, plus six others. (Why six?) Luckily there is an
easier way to calculate such probabilities.
Theorem B-1 The conditional probability Pi (Xn = j) is the (i, j)th entry
in the matrix P n .
In the problem above, matrix multiplication gives

0 1
À 3245 3667
!
0
6912 6912
P4 = 3667 6701
,
1
10368 10368
3245
so that P0 (X4 = 0) = 6912
= .46947.
Probability vectors Let φ0 = (φ0 (1), . . . , φ0 (i), . . . , φ0 (N )) be the 1 × N

row vector whose ith entry is P(X0 = i). This vector gives the distribution
of the random variable X0 . If we multiply φ0 on the left of the N × N
matrix P n , then we get a new row vector:
pn (1, 1) ··· pn (1, j) ··· pn (1, N )

 
 pn (2, 1) ··· pn (2, j) ··· pn (2, N ) 
φ0 P n = (φ0 (1), . . . , φ0 (i), . . . , φ0 (N ))  .. ... .. ... .. 
 . . . 
pn (N, 1) · · · pn (N, j) · · · pn (N, N )
P P P ¡
= i∈S φ0 (i)pn (i, 1), . . . , i∈S φ0 (i)pn (i, j), . . . , i∈S φ0 (i)pn (i, N ) .
But for each j ∈ S,

X X
φ0 (i)pn (i, j) = P(X0 = i)P(Xn = j j X0 = i)
i∈S i∈S
X
= P(X0 = i, Xn = j)
i∈S
= P(Xn = j).
In other words, the vector φn = φ0 P n gives the distribution of Xn .
Example B-1 If we start in state zero, then φ0 = (1, 0) and

² 3245 3667 ³ ² ³
4 6912 6912 3245 3667
φ4 = φ0 P = (1, 0) 3667 6701
= , = (.46947, .53053).
10368 10368
6912 6912
On the other hand, if we flip a coin to choose the starting position then
17069 24403
φ0 = (1/2, 1/2) and φ4 = ( 41472 , 41472 ) = (.41158, .58840).
Theorem B-2 Markov chain probabilities Given time points 0 ≤ n0 ≤

n1 ≤ · · · ≤ nk , we have
P(Xn0 = i0 , Xn1 = i1 , . . . , Xnk = ik ) = φn0 (i0 )pn1 −n0 (i0 , i1 ) · · · pnk −nk−1 (ik−1 , ik ).
C Invariant Probabilities
Definition C-1 A probability vector π is called invariant for the Markov

chain if π = πP .
Example C-1 Let’s find invariant probability vectors for some Markov
chains.
² ³
0 1
(a) Suppose that P = . An invariant probability vector π =
1 0
(π1 , π2 ) must satisfy
² ³
0 1
(π1 , π2 ) = (π1 , π2 ) ,
1 0
or, multiplying the right hand side, (π1 , π2 ) = (π2 , π1 ). This equation gives
us π1 = π2 , and since π1 + π2 = 1, we conclude that π1 = 1/2 and π2 = 1/2.
The unique invariant probability vector for P is π = (1/2, 1/2).
² ³
1 0
(b) If P = is the identity matrix, then any probability vector
0 1
satisfies π = πP .
(c) Consider a two state Markov chain with

² ³
3/4 1/4
P = .
1/6 5/6
The first entry of π = πP gives π1 = 3π1 /4 + π2 /6, which implies π1 =
2π2 /3. Using π1 + π2 = 1, we conclude that π = (2/5, 3/5).
Theorem C-1 A probability vector π is invariant if and only if there is a

probability vector v such that π = limn vP n .
Proof: (⇒) Suppose that π is invariant so π = πP . Multiplying on the

right by the matrix P gives πP = πP 2 , so that π = πP 2 . Repeating this
argument shows that π = πP n for all n ≥ 1. Therefore, π = limn πP n
(⇐) Suppose that π = limn vP n . Multiply both sides on the right by P to

obtain
πP = (lim vP n )P = lim vP n+1 = π.
n n
This shows that π is invariant. u
t
² ³
1−p p
Let’s investigate the general 2 × 2 matrix P = . It has
q 1−q
eigenvalues 1 and 1 − (p + q). If p + q > 0, then P can be diagonalized as
P = QDQ−1 , where
 q p 
p+q p+q 
² ³ ² ³
1 −p 1 0
Q= , D= , Q−1 =  .
1 q 0 1 − (p + q) −1 1 
p+q p+q
Using these matrices, it is easy to find powers of the matrix P . For example
P 2 = (QDQ−1 )(QDQ−1 ) = QD2 Q−1 . In the same way, for every n ≥ 1 we
have
P n = QDn Q−1
² n ³
1 0
= Q Q−1
0 (1 − (p + q))n
q
 p   p −p 
p+q p+q  np+q p+q 
= 1n  q p + (1 − (p + q))  −q .
 q 
p+q p+q p+q p+q
Now, if 0 < p + q < 2 then (1 − (p + q))n → 0 as n → ∞ and

 q p 
−π−
² ³
n p+q p+q 
P → q p  = −π− ,
p+q p+q
where π = (q/(p + q), p/(p + q)). For any probability vector v we have
−π−
² ³
n n
lim vP = v lim P = v = π.
n n −π−
This means that π is the unique limiting vector for P , and hence the unique
invariant probability vector.
Definition C-2 A Markov chain is called ergodic if P n converges to a

matrix with identical rows π as n → ∞. In that case, π is the unique
invariant probability vector.
Theorem C-2 If P is a stochastic matrix with P n > 0 for some n ≥ 1,

then the Markov chain is ergodic.
The next result is valid for any Markov chain, ergodic or not.
Theorem C-3 If P is a stochastic matrix, then (1/n) nk=1 P k → M . The

P
set of invariant probability vectors is the set of all convex combinations of
rows of M .
Proof: We assume the convergence result and prove the second state-
ment. Note that any vector π that is a convex combination of rows of M
can be written π = vM for some probability vector v.
(⇒) If π is an invariant probability vector, then π = πP k for all k ≥ 1.

Therefore À n !
1X k
π=π P → πM,
n k=1
which shows that π is a convex combination of the rows of M .
(⇐) Suppose that π = vM for some probability vector v. Then

À n !
X 1
πP = vM P = lim v Pk P
n
k=1
n
1
= lim v(P 2 + · · · + P n+1 )
n n
1 1
= lim v(P + · · · + P n ) + v(P n+1 − P )
n n n
= vM + 0
= π.
u
t
We sketch the argument for the convergence result n1 nk=1 P k → M .

P
The (i, j) entry of the approximating matrix n1 nk=1 P k can be expressed

P
in terms of probability as
n n n
À n !
1X k 1X 1X 1X
P = Pi (Xk = j) = Ei (1(Xk =j) ) = Ei 1(Xk =j) .
n k=1 ij n k=1 n k=1 n k=1
This is the expected value of the random variable representing the average
number of visits the Markov chain makes to state j during the first n time
periods. A law of large numbers type result will be used to show why this
average converges.
Definition C-3 A random time T : Ω → f0, 1, 2, . . .g ∪ f∞g is called a

stopping time if, for every n ≥ 0, 1[T =n] is a function of X0 , . . . , Xn .
Definition C-4 The hitting time of a subset E ⊆ S is defined by TE (ω) :=

inf(n ≥ 1 : Xn (ω) ∈ E), where we let the infimum of the empty set be ∞.
Similarly, the visit time is given by VE (ω) := inf(n ≥ 0 : Xn (ω) ∈ E).
Let Tj be the hitting time of the state j. There are two possibilities for
the
Psequence 1(Xk =j) : if Tj = ∞, then it is just a sequence of zeros, and
1 n
n k=1 (Xk =j) = 0. On the other hand, if Tj < ∞, then the history of the
1
process up to Tj is irrelevant and we may just as well start counting visits
to j from time Tj . This leads to the equation
mij = Pi (Tj < ∞) mjj .
A more rigorous proof of this important formula can be found in the ap-
pendix.
Putting i = j above, we discover that if Pj (Tj < ∞) < 1, then mjj = 0.
Thus mij = 0 for all i ∈ S and hence πj = 0 for any invariant probability
vector π.
Now assume that Pj (Tj < ∞) = 1; in this case Ej (Tj ) < ∞ (see Theorem
A-3 in the appendix). The following sample path shows the first n + 1
values of the sequence 1(Xk =j) , where we assume that the (` + 1)th visit to
state j occurs at time n. The random variable Tjs is defined as the time
between the (s − 1)th and sth visit. These are independent, identically
distributed random variables with the same distribution as Tj .
nth trial
↓
{z } 00000000000001
1 |00001 | {z | {z } · · · 000001
} 00000001 | {z } · · ·
1 2 3
Tj Tj Tj T `j
The average number of visits to state j up to time n can be represented as

the inverse of the average amount of time between visits. The law of large
numbers says that, Pj -almost surely,
Pn
k=1 1(Xk =j) ` 1
= P` → > 0.
n k
k=1 Tj
Ej (Tj )
Pn
We conclude that (1/n) k=1 P k → M , where mij = Pi (Tj < ∞)/Ej (Tj ).
Example C-2 (a) The rat. Suppose that a rat wanders aimlessly
through the maze pictured below. If the rat always chooses one of the
available doors at random, regardless of what’s happened in the past, then
Xn = the rat’s position at time n, defines a Markov chain.
1 2
3 4
The transition matrix for this chain is

 
0 1/3 1/3 1/3
 1/2 0 0 1/2 
P =  1/2 0

0 1/2 
1/3 1/3 1/3 0
To find the invariant probability vector, we rewrite the equation π = πP

as (I − P t )π = 0, where P t is the transpose of the matrix P and I is the
identity matrix. The usual procedure of row reduction will lead to the
answer.
   
1 −1/2 −1/2 −1/3 1 −1/2 −1/2 −1/3
 −1/3 −1/3 
1 0  →  0 5/6 −1/6 −4/9 
 

 −1/3 0 1 −1/3   0 −1/6 5/6 −4/9 
−1/3 −1/2 −1/2 1 0 −2/3 −2/3 8/9
   
1 0 −3/5 −3/5 1 0 0 −1
 0 1 −1/5 −8/15 
 →  0 1 0 −2/3 
 

 0 0 4/5 −8/15   0 0 1 −2/3 
0 0 −4/5 8/15 0 0 0 0
This last matrix tells us that π1 − π4 = 0, π2 − 2π4 /3 = 0, and π3 − 2π4 /3 =

0, in other words the invariant vector is (π4 , 2π4 /3, 2π4 /3, π4 ). Because
π1 +π2 +π3 +π²4 = 1, we need ³ π4 = 3/10 so the unique invariant probability
3 2 2 3
vector is π = , , , .
10 10 10 10
(b) Random walk. Here is an example from the final exam in 2004.
α 1/2 1/2 α
........ ........ ........ .
.
........................................
.. .........
....................................... ...... ...................................... .........................................
...... ... ..... ..... ..... .......... ... ..... ...... .....
..... ... .... .... ... ..... ...
... ... ...
• • ...
• • ...
• ...
• • ...
0 1 ... j−1 j j+1 ... N −1 N
A particle performs a random walk on f0, 1, . . . , Ng as drawn above. On

the interior, it is a symmetric random walk. From either of the boundary
points 0 or N , the particle either jumps to the neighboring state (with
probability α) or sits still (with probability 1 − α). Here, α can take any
value in [0, 1], so this includes both reflecting and absorbing boundaries, as
well as so-called “sticky” boundaries.
Find all invariant probability measures for every 0 ≤ α ≤ 1.
Solution: The vector equation π = πP gives

π0 = (1 − α)π0 + π1 /2
π1 = απ0 + π2 /2
πj = πj−1 /2 + πj+1 /2, for j = 2, . . . , N − 2
πN −1 = απN + πN −2 /2
πN = (1 − α)πN + πN −1 /2,
which can be re-written as
2απ0 = π1
π1 − 2απ0 = π2 − π1
πj − πj−1 = πj+1 − πj , for j = 2, . . . , N − 2
πN −1 − 2απN = πN −2 − πN −1
2απN = πN −1 .
Adding the first two equations shows that π2 = π1 , and then the middle set
of equations implies that π1 = Pπ2 = π3 = · · · = πN −1 . If α > 0, then both
π0 and πN equal π1 /2α. From N j=0 πj = 1, we get the unique solution
1 α
π0 = πN = , πj = , j = 1, . . . , N − 1.
2((N − 1)α + 1) (N − 1)α + 1
If α = 0, then we find that π1 = π2 = · · · = πN −1 = 0, and π0 , πN are any

two non-negative numbers that add to 1.
(c) Google search engine. One of the primary reasons why Google is
such an effective search engine is the PageRank algorithm developed by
Google’s founders, Larry Page and Sergey Brin, when they were graduate
students at Stanford. An explanation of PageRank is available at
http : //en.wikipedia.org/wiki/PageRank
Imagine surfing the Web, randomly choosing an outgoing link from one page
to get to the next. This can lead to dead ends at pages with no outgoing
links, or cycles around cliques of interconnected pages. So, with small
probability p, the next step is to simply choose a random page anywhere
on the whole Web. This theoretical random walk of the Web is a Markov
chain. The limiting probability that a random surfer visits any particular
page is its PageRank. A page has high rank if it has links to and from
other pages with high rank.
To illustrate, imagine a miniature Web consisting of six pages labelled A
to F , connected as below.
.................. ..................
..... ... ..... ...
... ... ... ...
..... .....
...
.....
D ... ......
.
...............
...
.
. .
E .....
...
.
............... ........................ . . . . . . . .
... .................. .
.. . ..
......... ....... .. . . . .. . ..
... ....... ....... ...
..... ....... .... . ...
... ..... ....... ..... ...
.....
..... ....... .....
...
.....
..... .......
.. .
..
...... ...
...
..... ..... .... ...
... ............ ........ .............. ......... .
..
. ...
... . . ... ....... ........ ............. ...
... ..... ...
... .... ................................... . ..
...
...
....
..
B ...
...
...... .
.
.
. .. .
. .. . .
.. ..
................. ......
C .
....
... ...
...
... . .
.. .
. ..... .. . .. ...
..
.
......... ..... ...
.. ................. ......... ..
......... . .
...... ...
... ...
... . ... ........ ..
. ...
... ....... ..
. ........ .....
.. ...
........ .... ....... ..... .
........ ..... .. .
. . . ..... ... . ....... .
. ...........
.
.. ... ................ ....................
. .
.
. ..... ............ ....... ............ ....
.... ..... ........ ....
.. ..... ... ...
..... .................................................................................................................................................... ...
...
......
A . ....
.
.. ...
...
....... .......
F .. ...
.
................ ........
The connectivity matrix of this miniweb is:
A B C D E F
 
0
A 1 1 0 0 0
B
0 0 1 1 0 0


C0 0 0 1 1 1
G=  
D1 0 0 0 0 0
 
E0 0 0 0 0 1
F 1 0 0 0 0 0
P
The transition probabilities are pij = (1 − p)gij / k gik + p/n. With n = 6
and p = .15, we get
 
1 18 18 1 1 1
 1 1 18 18 1 1 
 
1  1
 1 1 37/3 37/3 37/3 
P = 
40 
 35 1 1 1 1 1 

 1 1 1 1 1 35 
35 1 1 1 1 1
Using software to solve the matrix equation π = πP , we get
π = (.2763, .1424, .2030, .1431, .0825, .1526),
so the pages ranked according to their PageRank are A, C, F, D, B, E.
D Classification of states
Definition D-1 A state j ∈ S is called transient if Pj (Tj < ∞) < 1 and

called recurrent if Pj (Tj < ∞) = 1.
Definition D-2 A state j ∈ S is called null if mjj = 0 and called positive

if mjj > 0.
Note. Let Rjs be the time of the sth return to state j, so Rj1 = Tj1 = Tj .
We have Pj (Rjs < ∞) = Pj (Tj < ∞)Pj (Rjs−1 < ∞), and by induction we
prove that Pj (Rjs < ∞) = [Pj (Tj < ∞)]s . Letting s → ∞, we obtain
º
¢ s
£ 0 if j is transient
Pj (visit j infinitely often) = Pj ∩s (Rj < ∞) = .
1 if j is recurrent
The probability of infinitely many visits to state j is either zero or one,
according as the state j is transient or recurrent.
Definition D-3 Two states i, j ∈ S are said to communicate if there exist

m, n ≥ 0 so that pm n
ij > 0 and pji > 0.
By this definition, every state communicates with itself (reflexive). Also, if

i communicates with j, then j communicates with i (symmetric). Finally, if
i and j communicate, and j and k communicate, then i and k communicate
(transitive). Therefore “communication” is an equivalence relation and we
can divide the state space into disjoint sets called communication classes.
We say that a Markov chain is irreducible if all states communicate.
Theorem D-1 If i and j communicate, they are either both null or both
positive.
Proof: If i is null, then mii = 0 and the equation mji = Pj (Ti < ∞) mii
shows that mji = 0 for all j ∈ S. The jth row of the matrix M is invariant
for P , and hence for any power of P , so that
X
0 = mji = mjk pnki ≥ mjj pnji .
k∈S
If i and j communicate, we can choose n so pnji > 0, and conclude that

mjj = 0. u
t
All states within each communicating class are of the same type, so we will
speak of null or positive classes.
Lemma D-1 If j is recurrent and Pj (Ti < ∞) > 0, then Pi (Tj < ∞) = 1.
In particular, j communicates with i, which means you cannot escape a
recurrent class.
Proof: If j is recurrent, the chain makes infinitely many visits to j.

Define T = inf(n > Ti : Xn = j) to be the first time the process hits state
j after hitting state i. Then
Pj (Ti < ∞) = Pj (Ti < ∞, T < ∞) = Pj (Ti < ∞)Pi (Tj < ∞).
The first equation is true because, starting at j, the process hits j at arbi-
trarily large times. The second equation comes from applying the strong
Markov property at time Ti . If Pj (Ti < ∞) > 0, then we can divide it out
to obtain Pi (Tj < ∞) = 1. u
t
Theorem D-2 If j is transient, then it is null. If j is recurrent, then it is

positive.
Proof: If j is transient, the equation mjj = Pj (Tj < ∞)mjj shows that
mjj = 0.
Suppose that j is recurrent. Since the row sums of M equal 1, we can

choose i ∈ S so that mji > 0. The equation mji = Pj (Ti < ∞)mii shows
that Pj (Ti < ∞) > 0 and that i is positive. Lemma D-1 shows that i and
j communicate, and Theorem D-1 says that j is also positive. u
t
For finite S, transient = null and recurrent = positive.
A general stochastic matrix with recurrent classes R1 , . . . , Rr , after a pos-

sible reordering of the states, looks like
..
P1 0 · · · 0 .........
 
.
 0 P2 · · · 0 ......... 

 0 ... ...
... 0
P = 0 0 ..... ... 

.
.
 0
 0 · · · P r
.
....
.
.
..


..
...
..
S ...
...
...
Q
..
Each recurrent class R` forms a little Markov chain with transition matrix
P` . When you take powers you get

 n ..
P1 0 ··· 0 ...

..
...
 0 P2n ··· 0 ...
...
...

 .. ...
... 0 
P = 0
n  0 . 0 ...
...
...


...
 0
 0 · · · Prn ...
...
...


...
...
Sn ...
...
...
Qn
..
Averaging these from 1 to n, and letting n → ∞ reveals the structure of

the matrix M .
.
M1 0 · · · 0 .........
 
...
 0 M2 · · · 0 ........ 
 ... .... 0
M = 0
 0 0 .........  
...
 0
 0 · · · Mr ...
...
.
.


....
...
..
S∞ ...
...
0
....
If i and j are in the same recurrent class R` , then Lemma D-1 shows that
Pi (Tj < ∞) = 1 and so mij = mjj . That is, the rows of M` are identical
and give the unique invariant probability vector for P` .
Example D-1
² ³
0 1
(a) If P = , then there is only the one recurrent class R1 =
1 0
f0, 1g. The invariant probability must be unique and have strictly positive
entries.
² ³
1 0
(b) If P = , then there are two recurrent classes R1 = f0g
0 1
and R2 = f1g. The invariant measures are π = a(1, 0) + (1 − a)(0, 1) for
0 ≤ a ≤ 1. That is, all probability vectors!
(c) Suppose we have

 
1/2 1/2 0 0 0
 1/6 5/6 0 0 0 
 
P = 0
 0 3/4 1/4 0 .
 0 0 1/6 5/6 0 
1/5 1/5 1/5 1/5 1/5
The classes are R1 = f0, 1g, R2 = f2, 3g, and T1 = f4g. The invariant
measures are π = a(1/4, 3/4, 0, 0, 0) + (1 − a)(0, 0, 2/5, 3/5, 0) for 0 ≤ a ≤ 1.
None of these puts mass on the transient state.
(d) Take a random walk with absorbing boundaries at 0 and N . We can

reach 0 from any state in the interior, but we can’t get back. The inte-
rior states must therefore be transient. Each boundary point is recurrent,
so R1 = f0g, R2 = fN g, and T1 = f1, 2, . . . , N − 1g and the invariant
probability vectors are
π = a(1, 0, 0 . . . , 0, 0) + (1 − a)(0, 0, . . . , 0, 1)
= (a, 0, 0 . . . , 0, 1 − a) for 0 ≤ a ≤ 1.
E Hitting times
Partition the state space S into two pieces D and E. We suppose that for
every starting point in D it is possible to reach the set E. We are interested
in the first transition from D to E.
S . . . ...
..........
........
............
...............................................................
..
..
...
........
......
.....
....... ............. .....
......
....
... .
....
.
...
i .
.
.
.
.
.
.. ..
....
...
...
...
. ...
.
...........
.... . . .
....... ..
..•
. ........
. .
.........
.
.
.
.
.
E
...
...
...
..
..
.
....
• .... .
......
.
.
.
......
.....
..... .
.. •
.. .
..............
.............................. .
.. ...
..
..
. . . .
..
.... •
.....
... ....
.
.....
.
..
.... .
. .............................. ...
.
... .....
..... ...
. • ...
.
... ... ..... . ...
..... ... ..... .. ...
... .... ..
...... .... .... ......
... ........ .....
...
...
...
... ...
..
... .....
.....
.....
..... D ......
. .
....................
....
..... .... ........ ...
. .
..... . .. ......
.......... .......
......
.......
.........
..............
• .
.
.........
...........
....
. . ..
...............................................
Let Q be the matrix of transition probabilities from the set D into itself,
and S the matrix of transition probabilities of D into E.
The row sums of (I − Q)−1 give the expected amount of time spent until
the chain hits E.
The matrix (I − Q)−1 S gives the probability distribution of the first state
hit in E.
Proofs of this matrix magic are in the appendix.
Example E-1 (a) The rat. Recall the rat in the maze.
1 2 3 4
 
1 0 1/3 1/3 1/3
1 2
2  1/2 0 0 1/2 
P =  
3  1/2 0 0 1/2 
3 4
4 1/3 1/3 1/3 0
Here’s a question we can answer using methods from a previous section:

Starting in room 1, how long on average does the rat take to return to
room 1? We are looking for E1 (T1 ), which is 1/π1 for the unique invariant
probability vector. Since π1 = 3/10, the answer is 10/3 = 3.333.
Here’s a similar sounding question: Starting in room 1, how long on average
does the rat take to enter room 4? For this we need the new results from this
section. Divide the state space into two pieces D = f1, 2, 3g and E = f4g.
The corresponding matrices are
1 2 3 4
   
0 1/3
1 1/3 1 1/3
Q= 2  1/2 0 0  and S = 2 1/2 .
3 1/2 0 0 3 1/2
We calculate
1 2 3 1 2 3
   
1 1 −1/3 −1/3 3/2 1/2 1/2
1
I−Q = 2 −1/2
 1 0  and (I−Q)−1 = 2 3/4 5/4 1/4 .
3 −1/2 0 1 3 3/4 1/4 5/4
The row sums of (I − Q)−1 are

   
E1 (T4 ) 5/2
 E2 (T4 )  =  9/4  ,
E3 (T4 ) 9/4
and the first entry answers our question: E1 (T4 ) = 5/2.
(b) $100 or bust. Consider a random walk on the graph pictured below.
You keep moving until you hit either $100 or ruin. What is the probability
that you end up ruined?
◦..... ....◦$100
..... ....
..... .. .
........
.◦...
.
. .... ........
.
.. .....
◦..... ...
◦ ruin
start
E consists of the states $100 and ruin, so the Q and S matrices look like:
   
0 1/3 1/3 1/3 0
Q =  1/4 0 1/4  S =  1/4 1/4  .
1/3 1/3 0 0 1/3
A bit of linear algebra gives
    
11/8 2/3 5/8 1/3 0 5/8 3/8
−1
(I − Q) S =  1/2 4/3 1/2   1/4 1/4  =  1/2 1/2  .
5/8 2/3 11/8 0 1/3 3/8 5/8
Starting from the bottom left hand corner there is a 5/8 chance of being
ruined before hitting the money. Hey! Did you notice that if we start in
the center, then getting ruined is a 50-50 proposition? Why doesn’t this
surprise me?
(c) The spider and the fly. A spider performs a random walk on the
corners of a cube and eventually will catch and eat the stationary (and
rather stupid!) fly. How long on average does the hunt last?
............................................ spider
......... .
..................................................... ....
... .. .
.. ...
... .. .. ..
... . ....
... .. .. ..
... ...... . . . . ... . .............
...............................................
fly
To begin with, it helps to squash the cube flat and label the corners to see
what is going on.
1 4
•
...... ......
•
.....................................................................................
..... ...........
......
.. ......
......
...... ...... ...... ......
....... ...... ..... ......
......... ................ ......
...... . ..
........... ......
... ... ......
........ ........ ......
....... ......
... ...
F .......
. ..
...
2 .......
. ..... .
.5
.......
.
...
......
S
......
......
• ..
......
......
• ..
......
......
. .
.
•
..........
...
•
..............................................................................................................................................................................................................................
........
.
...... ......
...... ...... ......
. ......
...... ...... .......... .........
.
...... ........ .....
...... .......... .....
...... ...... ........... ......
...... ...... .....
......
......
3
......
...... ........ .........
......
.6
........ ....
...................................................................................
..
...
• •
Here is the transition matrix for the chain.

F 1 2 3 4 5 6 S
 
F 0 1/3 1/3 1/3 0 0 0 0
1
 1/3 0 0 0 1/3 1/3 0 0 


2  1/3 0 0 0 1/3 0 1/3 0 
 
3  1/3 0 0 0 0 1/3 1/3 0 
P =  
4
 0 1/3 1/3 0 0 0 0 1/3 


5 0 1/3 0 1/3 0 0 0 1/3 
 
6  0 0 1/3 1/3 0 0 0 1/3 
S 0 0 0 0 1/3 1/3 1/3 0
Using software to handle the linear algebra on the 7 × 7 matrix Q, we find

that the row sums of (I − Q)−1 are (10, 9, 9, 9, 7, 7, 7). The answer to our
question is the first one: ES (TF ) = 10.
(d) Random walk. Take the random walk on S = f0, 1, 2, 3, 4g with ab-
sorbing boundaries.
1−p p
. .......
.............. .........
..................................... ................ .................
...... ..... .......... .. .....
..... .... ... ....
...
• • ....
• ...
• •
0 1 2 3 4
Now put D = f1, 2, 3g and E = f0, 4g, so that
1 2 3 0 4
   
10 p 0 1−p 0 1
Q = 2 1 − p 0 p and S = 2 0 0 .
3 0 1−p 0 3 0 p
1 2 3
 2

1(1 − p) + p p p2
1
(I − Q)−1 = 2  1−p 1 p .
(1 − p)2 + p2
3 (1 − p)2 1−p 2
p + (1 − p)
The row sums give

   
E1 (TE ) 1 + 2p2
 E2 (TE )  = 1  2 .
(1 − p)2 + p2 2
E3 (TE ) 1 + 2(1 − p)
Matrix multiplication gives
0 4
 2

(1 − p + p )(1 − p)
1 p3
−1 1
(I − Q) S = 2 2
2  (1 − p)2 p2 .
(1 − p) + p
3 (1 − p)3 2
(1 − p + p )p
Starting in the middle, at state 2, we find that
4 .....
.....
...................
....... ......
.....
.....
..
...... .....
..
.... .....
.
... .....
.
. .....
....
. ...
2 3 .
.
..
.
.
.. ...
...
....
E(length of game) = , .
.
.
...
.
.
.
... .....
.....
(1 − p)2 + p2 .
.
.
...
.
...
..
.
.
.
... .....
.....
.....
.....
..
. .....
.....
2 .
0 0.5 1
1 ........................
.........
.......
......
.....
.....
.....
.....
.....
.....
.....
.....
(1 − p)2 0.5 .....
.....
.....
.....
P(ruin) = , .....
.....
(1 − p)2 + p2 .....
.....
.....
......
.......
.........
.......................
0
0 0.5 1
For instance, with an American roulette wheel we have p = 9/19 so that

the expected length of the game is 3.9889 and the chance of ruin .5524.
(e) Waiting for patterns. Suppose you start tossing a fair coin and that
you will stop when the pattern HHH appears. How long on average does
this take?
We define a Markov chain where Xn means the number of steps needed to
complete the pattern. The state space is S = f0, 1, 2, 3g and we start in
state 3 and the target is state 0. Define D = f1, 2, 3g and E = f0g.
The state of the chain is determined by the number of Hs at the end of the
sequence. Here · · · T represents any initial sequence of tosses, including the
empty sequence, provided it doesn’t have 3 heads in a row.
· · · T, ∅ State 3
···TH State 2
· · · T HH State 1
· · · T HHH State 0
The corresponding Q matrix and (I − Q)−1 are given by
1 2 3 1 2 3
   
1 0 0 1/2 2 1 2 4
Q = 2 1/2
 0 1/2 , (I − Q)−1 = 2 2
 4 6
3 0 1/2 1/2 3 2 4 8
The third row sum answers our question: E(T0 ) = 2 + 4 + 8 = 14.

We can apply the same idea to other patterns; let’s take T HT . Now the
states are given as follows:
· · · HH, ∅ State 3
· · · HHT State 2
···TT State 2
···TH State 1
· · · T HT State 0
The corresponding Q matrix and (I − Q)−1 are a little different
1 2 3 1 2 3
   
1 0 0 1/2 1 2 2 2
Q = 2 1/2 1/2
 0 , (I − Q)−1 = 2 2
 4 2
3 0 1/2 1/2 3 2 4 4
The third row sum E(T0 ) = 2 + 4 + 4 = 10 shows that we need on average

10 coin tosses to see T HT .
(f) A Markov chain model of algorithmic efficiency. Certain algorithms

in operations research and computing science act in the following way. The
objective is to find the best of a set of N elements. The algorithm starts
with one of the elements, and then successively moves to a better element
until it reaches the best.
In the worst case, this algorithm will require N − 1 steps. What about the
average case? Let Xn stand for the rank of the element we have at time n.
If the algorithm chooses a better element at random, then Xn is a Markov
chain with transition matrix
···
 
1 0 0 0 0 0
1 0 0 0 ··· 0 0
 
 
 
 1 1 
 0 0 ··· 0 0 

 2 2 

P = 1 1 1 .
 0 ··· 0 0 

 3 3 3 

 .. .. .. .. ... .. .. 

 . . . . . . 

 1 1 1 1 1 
··· 0
N −1 N −1 N −1 N −1 N −1
We are trying to hit E = f1g and so
0 0 0 ··· 0 0
 
 1 

 2 0 0 ··· 0 0  
 
 1 1 
Q=
 0 ··· 0 0  .
 3 3 
 . . . . .
...

 . . . . .
. . . . .

 
1 1 1 1
 
··· 0
N −1 N −1 N −1 N −1
A bit of experimentation with Maple will convince you that
1 0 0 ··· 0 0
 
1 

2 1 0 · · · 0 0 

 
1 1 
−1
(I − Q) = 
 1 · · · 0 0 .

 2 3 
. . . .. .. 
 . . . ...
. . . . . 

1 1 1 1
 
··· 1
2 3 4 N −1
Taking row totals shows that Ej (T1 ) = 1 + (1/2) + (1/3) + · · · + (1/(j − 1)).
Even if we begin with the worst element, we have EN (T1 ) = 1 + (1/2) +
(1/3) + · · · + (1/(N − 1)) ≈ log(N ). It takes an average of log(N ) steps to
get the best element. The average case is much faster than the worst case
analysis might lead you to believe.
F Column vectors
This short section is a companion to section B, where we considered prob-

ability measures on S as row vectors.
Any function g : S → R can also be considered as the column vector whose

coordinates are the numbers g(i) for i ∈ S. Multiplying this vector on the
right of the transition matrix P gives a new column vector (= function)
P g with coordinates
X
(P g)(i) = g(j)p(i, j) = Ei (g(X1 )).
j∈S
In section C, row vectors that satisfy π = πP were called invariant, and

played a special role in understanding the Markov chain. Similarly, func-
tions g that satisfy g = P g play an important role, especially in later
sections.
Definition F-1 A function g : S → R is said to be harmonic at i if

g(i) = (P g)(i). A function that is harmonic at all i ∈ S is simply called
harmonic.
In section C we showed that invariant probability vectors are convex com-

binations of rows of M . In the same way, we can show that harmonic
functions are linear combinations of the columns of M . Thus, every har-
monic function g can be written, for some constants cj , as
X
g(i) = cj Pi (Tj < ∞).
j∈S,j recurrent
2 COUNTABLE MARKOV CHAINS 29
2 Countable Markov Chains
We now consider a state space that is countably infinite, say S = f0, 1, . . .g

or S = Zd = f(i1 , . . . , id ) : ij ∈ Z, j = 1, . . . , dg. We no longer use linear
algebra but some of the rules carry over
X
p(x, y) = 1, for all x ∈ S,
y∈S
and X
pm+n (x, y) = pm (x, z)pn (z, y).
z∈S
A Recurrence and transience
Fix
P∞ a state x and assume X0 = x. Define the random variable R =
n=0 1(Xn =x) which records the total number of visits to state x. There
is, of course, a visit to x at time 0. From the Markov property, if we hit x
after time 0, the process starts over. Thus
E(R) = 1 + P(R > 1)E(R),
whence we conclude that
E(R)P(R = 1) = 1. (3)
If x is recurrent, then P(R = 1) = 0, and the equation above can only be

satisfied if E(R) = ∞. On the other hand, if x is transient, then P(R =
1) > 0, so that E(R) = 1/P(R = 1) < ∞.
Since
À∞ ! ∞ ∞
X X X
E(R) = E 1(Xn =x) = P(Xn = x) = pn (x, x),
n=0 n=0 n=0
the following theorem holds.

P∞
Theorem A-1 The state x is recurrent if and only if n=0 pn (x, x) = ∞.
Example A-1 One dimensional random walk

1−p p
..... ....
...................................... .....................................
....... ..... ........... .. .....
..... .... ... ....
... ...
• • •
... .
. .
• •
... x−2 x−1 x x+1 x+2 ...
Take x = 0 and assume that X0 = 0. Let’s find p2n (0, 0). In order that
X2n = 0, there must have been n steps ¡ to the left and n steps to the
2n
right. The number of such paths is n and the probability of each path
is pn (1 − p)n so
² ³
2n n (2n)! n
p2n (0, 0) = p (1 − p)n = p (1 − p)n .
n n!n!
√
Stirling’s formula says n! ∼ 2πn(n/e)n , so replacing the factorials gives
us an approximate formula
p
2π(2n)(2n/e)2n n n [4p(1 − p)]n
p2n (0, 0) ∼ p (1 − p) = √ .
2πn(n/e)2n πn
P P √
If p = 1/2, then n p2n (0, 0) ∼ n 1/ πn = ∞ and the P walk is recurrent.
But if p 6= 1/2, then p2n (0, 0) → 0 exponentially fast and n p2n (0, 0) < ∞
so the walk is transient.
Example A-2 Symmetric random walk in Zd

At each time, the walk moves to one of its 2d neighbors with equal proba-
bility.
...
...
...
...
• • ...
... •
.......
.
..........
...
...
.
• .
. • ...
....
. •
....................................................................................................................................
.
.........
......
....
.
....
• • ...
...
...
•
...
Exact calculations are a bit difficult, so we just sketch a rough calculation

that gives the right result. Suppose the walker takes 2n steps.
In order to return to zero, we need the number of steps in each direction

to be even: the chance of that is (1/2)d−1 .
Roughly, we expect he’s taken 2n/d steps in each direction. In each direc-
tion,
p the chance that a coordinate ends up back at 0 in 2n/d steps is about
1/ π(n/d). Therefore
² ³d−1 ² ³d/2
1 d
p2n (0, 0) ∼ .
2 πn
P −a
Since n < ∞ if and only if a > 1, we conclude that
º
d recurrent if d ≤ 2
Symmetric random walk in Z is .
transient if d ≥ 3
The general result for random walks
...................................................................................................................................................
... .. .. . . ..
... ... ... ... ..... ... ...
... ... ..... .. ... ... .....
... .... ..... .
..
..... ... .
... ...
... ........... ........... ... . ...
... ... .. .... .. ... ... ...
... ... ... ..... ... .... ...
... .... ... ... .
. .
....... ...
... ... ...... .......
.
..
... .....
.
.
..
. .
.. .
• ..
.
. .... .
.
. .
.
.
.. .....
.....
.
.............................................................................................................................................................
.
... ... .....

.
... .
... ...
...
. ............. . ... .
...
. ...
... ............ ..... ... ... ...
... ... ..... ... ... ...
... .....
..
... ... .... ... ....
.... ... ... ... ... ... ..
. .
...........................................................................................................................................................
... ..
... ... .. .. .. ..
...
... ..... .. ... .. ...
... .. . . .... ..... .
.... .....
... . ... . .... . ...
... .. ... ... ... ...
... ... ... ...
...
... ...
... .... ... .... ...
... .... ...
... ...
...........................................................................................................................................................
. .
The following general result is proved using harmonic analysis.
Theorem A-2 Suppose Xn is a genuine d-dimensional random

P P walk with
jxjp(0, x) < ∞. The walk is recurrent if d = 1, 2, and xp(0, x) = 0.
Otherwise it is transient.
B Difference equations and Markov chains on Z
Let Xn be a Markov chain on the integers, where p(y, y − 1) = qy , p(y, y +

1) = py , and p(y, y) = 1 − (qy + py ), where we assume py + qy ≤ 1. For
x ≤ y ≤ z, define
a(y) = Py (Xn will visit state x before z) = Py (Vx < Vz ).
Note that a(x) = 1 and a(z) = 0. Conditioning on X1 we find that the

function a is harmonic for x < y < z, that is,
a(y) = a(y − 1)qy + a(y)(1 − (qy + py )) + a(y + 1)py ,
which can be rewritten as
a(y)(qy + py ) = a(y − 1)qy + a(y + 1)py ,
or as
py [a(y) − a(y + 1)] = qy [a(y − 1) − a(y)].
Provided py > 0, we can divide to get
qy
a(y) − a(y + 1) = [a(y − 1) − a(y)],
py
and iterating gives us

qx+1 · · · qy
a(y) − a(y + 1) = [a(x) − a(x + 1)].
px+1 · · · py
For convenience, let’s define ry = qy /py , so the above equation becomes
a(y) − a(y + 1) = rx+1 · · · ry [a(x) − a(x + 1)].
This is even true for y = x, if we interpret the “empty” product rx+1 · · · rx

as 1. For any x ≤ w ≤ z we have
w−1
X w−1
X
a(x) − a(w) = [a(y) − a(y + 1)] = rx+1 · · · ry [a(x) − a(x + 1)]. (1)
y=x y=x
In particular, putting w = z gives

z−1
X
1 = 1 − 0 = a(x) − a(z) = rx+1 · · · ry [a(x) − a(x + 1)],
y=x
and plugging this back into (1) and solving for a(w) gives
Pz−1
y=w (rx+1 · · · ry )
a(w) = Pz−1 . (2)
y=x (rx+1 · · · ry )
Consequences
1. Let’s define the function b by
b(y) = Py (Vz < Vx ).
This function is also harmonic, but satisfies the opposite boundary con-
ditions b(x) = 0 and b(z) = 1. Equation (1) is valid for any harmonic
function, so let’s plug in b and multiply by -1 to get
w−1
X
b(w) = b(w) − b(x) = rx+1 · · · ry [b(x + 1) − b(x)]. (3)
y=x
Plugging in w = z allows us to solve for b(x + 1) − b(x) and putting this

back into (3) gives Pw−1
y=x (rx+1 · · · ry )
b(w) = Pz−1 . (4)
y=x (rx+1 · · · ry )
In particular we see that a(w) + b(w) = 1 for all x ≤ w ≤ z. That is, the
chain must eventually hit one of the boundary points fx, zg, provided all
the py ’s are non-zero.
2. For w ≥ x, define
A(w) = Pw (Vx < ∞) = lim a(w).

z→∞
If the limit of the denominator of (4) diverges, i.e., ∞

P
y=x (rx+1 · · · ry ) = ∞,
then limz→∞ b(w) = 0, so limz→∞ a(w) = A(w) = 1 for all w ≥ x.
P∞
On the other hand, if y=x (rx+1 · · · ry ) < ∞, then
P∞
y=w (rx+1 · · · ry )
A(w) = P∞ . (5)
y=x (rx+1 · · · ry )
In particular, we see that A(w) decreases to zero as w → ∞.

3. In the case where py = p and qy = q don’t depend on y,

 z
r − rw
 rz − rx if r 6= 1


a(w) =
z−w

 if r = 1.
z−x
Letting z → ∞ gives the probability we ever hit x from the right: A(w) =
1 ∧ rw−x . Arguing symmetrically using the “mirror image” process (−Xn )
shows that for any initial state:
Pw (Vx < ∞) = 1 ∧ rw−x .
4. Notice that A(x) = 1. The process is guaranteed to visit state x when

you start there, for the simple reason that we count the visit at time zero!
Let’s work out the chance of a return to state x. Using the equation
Py (Tx < ∞) = (P A)(y) at y = x we have
Px (Tx < ∞) = qA(x − 1) + (1 − (p + q))A(x) + pA(x + 1)

² ³
1
= q 1∧ + (1 − (p + q))1 + p(1 ∧ r)
r
= (q ∧ p) + (1 − (p + q)) + (p ∧ q)
= 1 − jp − qj.
This shows that the chain is recurrent if and only if p = q.
C Two types of recurrence
A Markov chain only makes finitely many visits to a transient state j, so

the asymptotic density of its visits, mjj must be zero. This is true for both
finite and countable Markov chains.
A recurrent state j may have zero density or positive density, though the
first is impossible in a finite state space S.
Definition C-1 A recurrent state j is called null recurrent if mjj = 0, i.e.,

Ej (Tj ) = ∞. It is called positive recurrent if mjj > 0, i.e., Ej (Tj ) < ∞.
The argument in section 1.D shows that if states i and j communicate,
and if mii = 0, then mjj = 0 also. The following lemma shows that if i
and j communicate, then they are either both recurrent or both transient.
Putting it all together, communicating states are of the same type: either
transient, null recurrent, or positive recurrent.
pkii < ∞ if and only if pkjj <

P P
Lemma C-1 If i and j communicate, k k
∞.
Proof: Choose n and m so pm n

ij > 0 and pji > 0. Then for every k ≥ 0,
we have pn+k+m
jj ≥ pnji pkii pm
ij , so that
X X X
pkjj ≥ pn+k+m
jj ≥ pm n
ij pji pkii .
k k k
Therefore k pkjj < ∞ implies k pkii < ∞. Reversing the roles of i and j
P P
gives the result. t
u
Example C-1 For the symmetric random walks in d = 1, 2, we have

n n √
1X k 1X 2
m00 = lim p ∼≤ lim √ = 0.
n k=1 00 n k=1 πk
These random walks are null recurrent.
D Branching processes
This is a random model for the evolution of a population over several gener-
ations. It has its origins in an 1874 paper by Francis Galton and Reverend
Henry William Watson called “On the probability of the extinction of fam-
ilies”. (see http://galton.org)
To illustrate, here is a piece of the Schmuland family tree:
• ........
..... .. .....
..... ... .....
..
....... .... ........
.....
..... .... .....
.... .....
..... .... .....
.....
.. . .
.. .. . .
............
.
............ ...
. • ..
......
• . .
...................
...
. ................
.
. • .....
...
..
.. .
... ...... ........ . ..
..
.. ...... ....... ....
. .
..... .. .
.... .. .... .
..... ..............
....... ......... ...
....... ...
... ... ..... .......
....... ..... ..... .....
..... .......
....... ..... ..... .......
.
...
.
.
.
.
...
.
... .......
. • . .... ..
.............
... ... .......
.
.
. • . • • . • ... .........
.... •
.
...
.
.
.
. ...
... .. ....... ....
..
. ... .. . ...
..
..... ..
....
. .
.....
.
...... ...
..
....
. ..
..... ... .. ...
....
. ....
. .
. ..... ...
. . . . .
• .
..
.....
.
.....
.
• .....
....
• .....
•
.......
.... ........
• ..
... ... .......
... ... .....
... .....
... .... .....
... .....
... .. .....
.
• • .
• .....
.
In this model, Xn is the number of individuals alive at time n. Each

individual produces offspring distributed like the random variable Y
y 0 1 2 3 ...
p(y) p0 p1 p2 p3 ...
We also assume that the individuals reproduce independently.

The process (Xn ) is a Markov chain with state space S = f0, 1, 2, . . .g. It
is not easy to write explicit formulae for the transition probabilities, but
we can express them as follows
p(k, j) = P(Y1 + · · · + Yk = j).
P∞
Let’s find the average size of generation n. Let µ = E(Y ) = j=0 jpj .
Then
E(Xn+1 j Xn = k) = E(Y1 + · · · + Yk ) = kµ,
so that
∞
X
E(Xn+1 ) = E(Xn+1 j Xn = k)P(Xn = k)
k=0
X∞
= kµP(Xn = k)
k=0
= µE(Xn ).
By induction, we discover that E(Xn ) = µn E(X0 ).

If µ < 1, then E(Xn ) → 0 as n → ∞. The inequality
P(extinction) ≥ P(Xn = 0) ≥ 1 − E(Xn ) → 1
shows that extinction is guaranteed if µ < 1.
Extinction. Let’s define a sequence an = P(Xn = 0 j X0 = 1), that

is, the probability that the population is extinct at the nth generation,
starting with a single individual. Conditioning on X1 and using the Markov
property we get
∞
X
P(Xn+1 = 0 j X0 = 1) = P(Xn+1 = 0 j X1 = k)P(X1 = k j X0 = 1)
k=0
∞
X
= P(Xn = 0 j X0 = k)P(X1 = k j X0 = 1)
k=0
X∞
= P(Xn = 0 j X0 = 1)k pk .
k=0
P∞ k
If we define φ(s) = k=0 pk s , then the equation above can be written
as an+1 P = φ(an ). Note that φ(0) = P(Y = 0), and φ(1) = 1. Also
∞
φ
P∞
0
(s) = k=0 pk ks
k−1
≥ 0, and φ0 (1) = E(Y ). Finally, note that φ00 (s) =
k=0 pk k(k − 1)s
k−2
≥ 0, and if p0 + p1 < 1, then φ00 (s) > 0 for s > 0.
The sequence (an ) is defined through the equations a0 = 0 and an+1 = φ(an )
for n ≥ 0. Since a0 = 0, we trivially have a0 ≤ a1 . Apply φ to both sides
of the inequality to obtain a1 ≤ a2 . Continuing in this way we find that
the sequence (an ) is non-decreasing. Since (an ) is bounded above by 1, we
conclude that an ↑ a for some constant a.
The value a gives the probability that the population will eventually be-
come extinct. It is the smallest solution to the equation a = φ(a). The
following pictures sketch the proof that a = 1 (extinction is certain) if and
only if E(Y ) ≤ 1.
Case µ ≤ 1 and a = 1 Case µ > 1 and a < 1
....
...
• ..
......
......
....
...
.
.....
.......
.. ...... .. ........
... ...
..............
. ... ..
............
.
.. . .......
... .......... ... ..........
......... ..........
... .......... ...
..... ....
... ...... ..... ... .............
... .
....................
.
. ... ..
..
... ....... .... ... ..... ....
..... ..... .... .....
... ....................... ... ..... ....
... ...... .. ..... ... ..... .....
... .
.. ..
................................ . ... .
..
...............
.
.... . .... .. . .
... ....... .... ......... . . ... ..... ....
....... .... .....
... ....... .. ..... .. .. ... ..... .....
... .
...
. .
.
..
. ... .. ... ..... .....
...
.
........................................
....... .. ... ..
..............
... ........ .... ..... . ... ... ... ..... .....
.. ........ ...... ... ..... .....
...
..........
.....
.. .
. .
.... . . ... ..... .....
... .......... .... ......... ... ...........
... .. .
...
. ..
.. .
. ... ..
.
.
. .
.... ... ... ................
.
..... ........... . ......
.... .... ........................ ... ..........
........... ......................................................................................................
.... ..... ..
.
.
.
..
... ...
. .
...
...
• ........
......... ..
.............. .
... ..... . ... . . .
... ....... . ..
.
. .
.... ... ...
.
. .
................... ....
....
. .
. .
. ........ ......
... .......
. ... ....... .......
... .
. . . ... ... ... ........................... . ...
.
... .
..
.
.. .......... .... ........... ...
... ..... . . ... ... ..
...
. .. .
... ..... .. .. ... ........... ... .... .. ..
... .
....... ..
.
.
..
.
. .. .. ..........................................................................................
. .
. .
.
.
....
... ..
..... . . ....
.. ..
....
. . .
... .
..... . . ... ... ... ..... . ... ... ...
... .
..
.
.
... .
.
.
.
.
. . . ... .
..
...... ... . . .
. .
... ..... ... .....
... ..
...... ..
.
.
..
.
.
.. ..
.. .. ... ...
..... ..
.
. .. ....
.
.
..
..
... ..
....
. . .
... ...
.... . . .
... ..
..... .
. .
. ... ... ... ....... .
. .
. .
. ...
... ..
....
. .
. .
. . . ... ..
....
. .
. . ..
. .
... .... ... ....
..... . . .. .. ..... . . .. ..
... ......... .... .... .. .. ... ......... .... .... ... ..
... ....... ... .......
.......... ... ... .. .. .......... ... ... ... ..
................................................................................................................................................................................................................................... ..................................................................................................................................................................................................................................
a0 a1 a2 a3a4 1 a0 a1 a2 a3 a 1
Example D-1
(a) p0 = 1/4, p1 = 1/4, p2 = 1/2. This gives us µ = 5/4 and φ(s) =

1/4 + s/4 + s2 /2. Solving φ(s) = s gives two solutions f1/2, 1g. Therefore
a = 1/2.
(b) p0 = 1/2, p1 = 1/4, p2 = 1/4. In this case, µ = 3/4 so that a = 1.
(c) Schmuland family tree. p0 =√7/15, p1 = 3/15, p2 = 1/15, p3 = 4/15.

This gives µ = 1.1333 and a = ( 137 − 5)/8 = .83808. There is a 16%
chance that our surname will survive into the future!
3 OPTIMAL STOPPING 39
3 Optimal Stopping
A Strategies for winning
Think of a Markov chain as a gambling game. For example, a random walk

on f0, 1, . . . , Ng with absorbing boundaries could represent your winnings
at the roulette wheel.
Definition A-1 A payoff function f : S → [0, ∞) assigns a “payoff” to

each state of the Markov chain. Think of it as the amount you would
collect if you stop playing with the chain in that state.
Example A-1 In the usual situation, f(x) = x, while in the situation

where you owe the mob N dollars the payoff is f(x) = 1N (x).
f(x) = x •....... f(x) = 1N (x)

...
...
..
....
...
• ...
...
...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
...
...
• ...
...
...
...
...
...
...
... ...
...
...
...
...
...
....
...
• .
.
....
...
.
.
....
...
...
...
...
•.......
... ... ... ... ...
... ... ... ... ..
.... .... .... .. ...
•
....................................................................................................................................................................................................... • • • •
......................................................................................................................................................................................................
0 1 2 ··· N − 1 N 0 1 2 ... N − 1 N
Definition A-2 A stopping time (strategy) is a rule that tells you when
to stop playing. Mathematically, it is a random variable with values in
f0, 1, 2, . . .g ∪ f∞g so that fT = ng ∈ σ(X0 , . . . , Xn ) for all n ≥ 0. Equiva-
lently, a stopping time satisfies fT ≤ ng ∈ σ(X0 , . . . , Xn ) for all n ≥ 0.
Example A-2
(a) T ≡ 0, i.e., don’t gamble.
(b) T ≡ 1, i.e., play once, then stop.
(c) T = “play until you win 3 in a row, then quit.”

We will always assume that P(T < ∞) = 1 in this section.
Definition A-3 The value function v : S → [0, ∞) is defined as the great-

est expected payoff possible from that starting point;
v(x) = sup Ex (f(XT )).

T
There is an optimal strategy Topt so that
v(x) = Ex (f(XTopt )).
Facts about v
1. Consider the strategy T0 ≡ 0 (don’t gamble). Then
v(x) = sup Ex (f(XT )) ≥ Ex (f(X0 )) = f(x).

T
That is, v(x) ≥ f(x) for all x ∈ S.

2. Define the strategy T ∗ : play once, then follow the optimal strategy.
Then
v(x) ≥ E(f(XT ∗ ) j X0 = x)
X
= E(f(XT ∗ ) j X1 = y)p(x, y)
y∈S
X
= v(y)p(x, y)
y∈S
= (P v)(x).
Therefore v(x) ≥ (P v)(x) for all x ∈ S. Such a function is called

superharmonic.
3. By definition, a superharmonic function u satisfies u(x) ≥ (P u)(x),
or
Ex (u(X0 )) ≥ Ex (u(X1 )).
It turns out that for any two stopping times 0 ≤ S ≤ T < ∞ we have
Ex (u(XS )) ≥ Ex (u(XT )).

4. Suppose that u is a superharmonic function and u(x) ≥ f(x) for all

x ∈ S. Then
u(x) = Ex (u(X0 )) ≥ Ex (u(XTopt )) ≥ Ex (f(XTopt )) = v(x).
Putting all this together gives the following.
Lemma A-1 The value function v is the smallest superharmonic function

≥ f.
Here is the main theorem of this section.
Theorem A-1 The optimal strategy is given by VE := inf(n ≥ 0 : Xn ∈ E),

where E = fx ∈ S j f(x) = v(x)g.
Proof: In any recurrent class in S, the state z with the maximum f

value satisfies f(z) = v(z). Since the Markov chain must eventually visit
all states of some recurrent class, we see that Px (VE < ∞) = 1.
Define u(x) = Ex (f(XVE )), and note that, since v is superharmonic,
u(x) = Ex (v(XVE )) ≤ v(x).
Define another strategy TE = inf(n ≥ 1 : Xn ∈ E), so that TE ≥ VE and
(P u)(x) = Ex (v(XTE )) ≤ Ex (v(XVE )) = u(x),
showing that u is superharmonic.
The last thing is to show that u(x) ≥ f(x). Fix a state y so that f(y) −
u(y) = supx (f(x) − u(x)). Then
u(x) + f(y) − u(y) ≥ f(x) for all x ∈ S.
Since the left hand side is superharmonic, we get
u(x) + f(y) − u(y) ≥ v(x) for all x ∈ S.
in particular, f(y) ≥ v(y) so that y ∈ E. Then
u(y) = Ey (f(XVE )) = f(y),
so that u(x) ≥ f(x) for all x ∈ S. u
t
Corollary. v(x) = maxff(x), (P v)(x)g.

B Examples
Example B-1 Take the random walk on f0, 1, . . . , Ng with q ≥ p, and

f(x) = x. Then
(P f )(x) = qf(x−1)+(1−(q +p))f(x)+pf (x+1) = x−(q −p) ≤ x = f(x).
This shows that f is superharmonic. Therefore v = f everywhere and

E = S. The optimal strategy is VE ≡ 0, i.e., don’t gamble!
Exercise: What is the optimal strategy if p > q? What about if the

boundary at 0 is not absorbing?
Example B-2 Take the random walk on f0, 1, . . . , Ng with absorbing

boundaries and f(x) = 1N (x). Let’s show that the optimal strategy is
to continue until you hit f0, N g.
Certainly f0, N g ⊆ E since absorbing states x always satisfy v(x) = f(x).
For 1 ≤ x ≤ N − 1, there is a non-zero probability that the chain will hit
N before 0 and so,
f(x) = 0 < Ex (f(XVf0,Ng )) ≤ v(x).
Thus, v(x) > f(x) so x does not belong to E.

Note that the value function gives the probability of ending up at N ;
v(x) = Ex (f(XVE )) = Px (XVE = N ).
The function v is harmonic on f1, 2, . . . , N − 1g so, as in section 2C, we

calculate directly
1 − (q/p)x


 , p=6 q
1 − (q/p)N

v(x) = .
 x
p=q


N
Example B-3 Zarin case The following excerpt is taken from What is
the Worth of Free Casino Credit? by Michael Orkin and Richard Kakigi,
published in the January 1995 issue of the American Mathematical Monthly.
In 1980, a compulsive gambler named David Zarin used a generous

credit line to run up a huge debt playing craps in an Atlantic City
casino. When the casino finally cut off Zarin’s credit, he owed over
$3 million. Due in part to New Jersey’s laws protecting compulsive
gamblers, the debt was deemed unenforceable by the courts, leading
the casino to settle with Zarin for a small fraction of the amount he
owed. Later, the Internal Revenue Service tried to collect taxes on the
approximately $3 million Zarin didn’t repay, claiming that cancellation
of the debt made it taxable income. Since Zarin had never actually
received any cash (he was always given chips, which he promptly lost
at the craps table), an appellate court finally ruled that Zarin had no
tax obligation. The courts never asked what Zarin’s credit line was
actually worth.
Mathematically, the payoff function is the positive part of x − k, where k
is the units of free credit.
f(x) = (x − k)+
•......
...
...
..
• ....
...
...
...
...
···
... ...
... ...
... ..
• • • •
..............................................................................................................................................................................................................................................................................
0 1 2 ··· k k+1 k+2 ···
Since the state zero is absorbing, we have v(0) = 0. On the other hand,
v(x) > 0 = f(x) for x = 1, . . . , k, so that 1, . . . , k 6∈ E. Starting at k, the
optimal strategy is to keep playing until you hit 0 or N for some N > k
which is to be determined. In fact, N is the smallest element in E greater
than k.
We have to eliminate the possibility that N = ∞, that is, E = f0g. But
the strategy Vf0g gives a value function that is identically zero. As this is
impossible, we know N < ∞.
The optimal strategy is Vf0,N g for some N > k. Using the previous example
we can calculate directly that
1 − (q/p)k
Ek (f(XVf0,Ng )) = (N − k) .
1 − (q/p)N
For any choice of p and q, we choose N to maximize the right hand side.
In the Zarin case, we may assume he played the “pass line” bet which gives
the best odds of p = 244/495 and q = 251/495, so that q/p = 251/244. We
also assume that he bets boldly, making the maximum bet of $15,000 each
time. Then three million dollars equals k = 200 free units, and trial and
error gives N = 235 and v(200) = 12.977 = $194, 655.
N Expected Profit (units) Expected Profit ($)

232 12.9169 193754.12
233 12.9486 194228.91
234 12.9684 194526.29
235 12.9771 194655.80
236 12.9751 194626.58
237 12.9632 194447.42
238 12.9418 194126.71
In general, we have the approximate formula N ≈ k + 1/ ln(q/p) and the

probability of reaching N is approximately 1/e = .36788. Therefore the
approximate value of k free units of credit is
1
v(k) ≈ ,
exp(1) ln(q/p)
which is independent of k!
Example B-4 Take a simple random walk (p = 1/2) with absorbing

boundaries on f0, 1, 2, . . . , 10g. Suppose the following payoff function is
given
[0, 2, 4, 3, 10, 0, 6, 4, 3, 3, 0].
Find the optimal stopping rule and give the expected payoff starting at
each site.
Solution: You can almost eyeball this one. The state 0 belongs to E because
it’s absorbing, and the state 4 belongs to E because it has the maximum
f value. For a symmetric random walk, a harmonic function is linear, so
the straight line from (0, 0) to (4, 10) gives the average payoff using the
strategy Vf0,4g . Since this line is higher than the f values at 1,2, and 3, we
conclude that these states do not belong to E.
..
........
... .........
•
.. .....
... .....
....
.
.
...
. .....
.....
.....
.....
v(x)
.... .....
.....
..
. .....
..
. .....
..
.
...
.
.
.. .....
.....
.....
.....
f(x)
.. .....
...
. .....
...
. .....
.....
.... .....
..
. .....
..
. .....
. .....
...
. .....
... .....
.. .....
.. .....
..
. .....
.
.. .....
.. .....
..
. .....
. .....
.... .....
.....
.
..
.
..
.
..
.
.
•
.....
...
...
.... ...
...
..
. ...
...
. ...
...
. ...
.... ...
...
..
. ...
...
. ...
.
... ...
•j .
... j j j j j j j j j •j ...
.
0 1 2 3 4 5 6 7 8 9 10
On the right hand side of the picture, the state 10 belongs to E because
it’s absorbing. If we were to connect (10, 0) with any of (8, 3), (7, 4), (6, 6),
(5, 0), or (4, 10) using a straight line, we’d see that (9, 3) is above the line.
This shows that, starting at 9, not gambling is the best bet. The state 9
belongs to E.
Finally, connecting (4, 10) and (9, 3) with a straight line beats the f values
at 5,6,7, and 8, which shows that none of these belong to E. To conclude:
the optimal strategy is to play until you visit E = f0, 4, 9, 10g.
Here is the corresponding value function:
x 0 1 2 3 4 5 6 7 8 9 10
v(x) 0 2.5 5.0 7.5 10 8.6 7.2 5.8 4.4 3.0 0
Example B-5 One die game. You roll an ordinary die with outcomes
f . , . . , ... , .. .. , ..... , ...
... g. You can keep the value or roll again. If you roll,
you can keep the new value or roll a third time. After the third roll you
must stop. You win the amount showing on the die. What is the value of
this game?
The state space for the Markov chain is
S = fStartg ∪ f(n, d) : n = 2, 1, 0; d = . , . . , ... , .. .. , ..... , ...

... g .
The variable n tells you how many rolls you have left, and this decreases
by one every time you roll. Note that the states with n = 0 are absorbing.
You can think of the state space as a tree, the chain moves forward along
the tree until it reaches the end.
The payoff function is zero at the start, and otherwise equals the number of
spots on d. At n = 0, we have v(0, d) = d, and we calculate v elsewhere by
working backwards, averaging over the next roll and taking the maximum
of that and the current value. The function v is given below in green, while
f is in red.
n= 2 1 0 d
........
.........
..... 4.25 1 3.5 1 1 1 .
.........
....
......... ....
4.25 2 3.5 2 2 2 .
........
.
......... ................
................................
.
....... ............................ .
.
..
..
.
. . .
..
. .. .. . .
Start ..................................................................................................
..............................
.............. ..............
...... .......
4.25 3 3.5 3 3 3
...... ........ ............................
4.66 0
...... ........ ..............
4.25 4 4 4 4 4 ..
...... ........
...... ........
......
...... ................
..
..
...... ........
..
...
...... .....
......
......
......
......
5 5 5 5 5 5
....
6 6 6 6 6 6 ...
...
The value of the game is $4.66.
Example B-6 Two dice game. The following game is played: you keep
rolling two dice until you get 7 or decide to stop. If you roll a 7, the game
is over and you win nothing. Otherwise, you stop and receive an amount
equal to the sum of the two dice. What are your expected winnings: a) if
you always stop after the first roll; b) if you play to optimize your expected
winnings?
Solution: The underlying Markov chain is a sequence of independent ran-

dom variables with distribution
y 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
p(y)
36 36 36 36 36 36 36 36 36 36 36
except that we turn 7 into an absorbing state.
P
a) Stopping at the first roll gives an expected payoff of y6=7 yp(y) = 35/6 =
5.8333.
b) For this kind of Markov chain, the P operation always gives a function
that is constant on all states except possibly 7. That is,
ºP
y6=7 v(y)p(y) x 6= 7
(P v)(x) =
v(7) x=7
That is, if x, z 6= 7, then (P v)(x) = (P v)(z). Now suppose that x 6∈ E so
that (P v)(x) = v(x) > f(x). Then x 6= 7, and if x > z where z 6= 7, then
f(z) = z < x = f(x) < v(x) = (P v)(x) = (P v)(z) ≤ v(z),
so that z also is not in E. This implies that, except for the state 7 (which
belongs to E), the set S \ E is completely to the left of E.
Now the question becomes: where is the boundary point? From the answer
to part a) we see that v(x) ≥ 35/6 > f(x) for x = 2, 3, 4, 5, so that
they don’t belong to E. The remaining possibilities are E = f7g ∪ fz, z +
1, . . . , 12g for z equal to either 6, 7, 8, 9, 10, 11, or 12. Let’s define Tz to be
the stopping time when the Markov chain first hits the set f7g ∪ fz, z +
1, . . . , 12g.
An Aside. Suppose the event A has probability p and B

has probability q, where p + q ≤ 1. In repeated independent
trials the chance that A will occur before B is p/(p + q).
Let’s analyze the strategy T6 with stopping set f6, 7, . . . , 12g. Define A =
f6g and B = f7, . . . , 12g. Then p = 5/36 and q = 21/36 so that, using the
aside, the chance of getting a six before any value greater than or equal to
seven is 5/26. That is, for x 6= 7,
Px (XT6 = 6) = 5/26.
In the same way we also calculate
Px (XT6 = 7) = 6/26, Px (XT6 = 8) = 5/26, Px (XT6 = 9) = 4/26
Px (XT6 = 10) = 3/26, Px (XT6 = 11) = 2/26, Px (XT6 = 12) = 1/26.
Putting it all together gives

12
X
Ex (f(XT6 )) = f(y)Px (XT6 = y)
y=6
= 6(5/26) + 0(6/26) + 8(5/26) + 9(4/26)
+ 10(3/26) + 11(2/26) + 12(1/26)
= 6.53846.
In the same way we also calculate
Ex (f(XT7 )) = 6.66666, Ex (f(XT8 )) = 6.66666, Ex (f(XT9 )) = 6.25000
Ex (f(XT10 )) = 5.33333, Ex (f(XT11 )) = 3.77777, Ex (f(XT12 )) = 1.71428.
From this we conclude that the optimal strategy is stop rolling as soon as
we get a sum of seven or greater, and the value of this strategy is 6.66666.
Example B-7
A simplified version of the “Wheel of Fortune” has four spaces: $1000,
$3000, $5000, and bankrupt. You can spin as many times as you like and
add to your fortune, but if you hit “bankrupt” you lose all your money and
the game is over. The state space is S = f0, 1, 2, . . .g where we count in
thousands of dollars.
We will find the best strategy for playing this game. We know that 0 ∈ E,
since 0 is absorbing. Next for x 6= 0 we calculate (P f )(x) = [f(x + 1) +
f(x+3)+f(x+5)]/4 = (3x+9)/4 to see that (P f )(x) > f(x) for 1 ≤ x ≤ 8.
Hence 1, 2, . . . , 8 6∈ E.
On the other hand, (P f )(x) ≤ f(x) for x ≥ 9. Normally this doesn’t
prove anything, but this is a special case since, except for jumping to 0,
the Markov chain is increasing. In other words, for any starting position
x ≥ 9, we may consider the state space equal to f0, 9, 10, . . .g. The payoff
function f is superharmonic everywhere on this space, so v = f and you
should never gamble.
In terms of the original chain, we conclude that E = f0, 9, 10, . . .g, and you
should keep spinning the wheel until you go bankrupt or reach $9,000 or
more.
C Algorithm to find optimal strategy
The value function v is the smallest superharmonic function that exceeds

f, but there is no direct formula to find it. Here is an algorithm that gives
approximations to v.
Algorithm. Define u1 (x) = supy:x→y f(y). The supremum is over all

states y that can be visited starting from x (this includes state x!). Then
let u2 = max(P u1 , f), u3 = max(P u2 , f), etc. The sequence (un ) decreases
to the function v.
Example C-1 How much would you pay for the following financial op-
portunity? You follow a random walk on the graph and there is no payoff
except $100 at state 4, and state 5 is absorbing.
1 •.................. .....
.....
• 4 ($100 bill)
.....
..
.....
..... .
.......
..... ....
..... .....
..... .....
..... .....
....
.....
3
..... .........
.........
....
.
• .
..........
..... .........
.....
...
.
...... .....
.....
.
..... .....
........ .....
.....
..
..... .....
..
....
. .....
...
. .....
2• • 5 (absorbing)
.
..
. ....
In vector form, the payoff function is f = (0, 0, 0, 100, 0). (For ease of
typesetting, we will render these column vectors as row vectors, OK?) The
“P ” operation takes a vector u and gives
³
u(2) + u(3) + u(4) u(1) + u(3) + u(5) u(1) + u(2) + u(4) + u(5) u(1) + u(3) + u(5)
²
Pu = , , , , u(5) .
3 3 4 3
The initial vector is u1 = (100, 100, 100, 100, 0) and P u1 = (100, 200/3, 75, 200/3, 0).
Taking the maximum of this with f puts the fourth coordinate back up to
100, giving u2 = (100, 200/3, 75, 100, 0).
Applying this procedure to u2 gives u3 = (725/9, 175/3, 200/3, 100, 0).

Putting this on a computer, and repeating 15 times yields (in decimal
format)
u15 = (62.503, 37.503, 50.002, 100.00, 0.00).
These give the value, or fair price, of the different starting positions on the
graph.
We may guess that if the algorithm were continued, the values would be
rounded to u∞ = v = (62.5, 37.5, 50.0, 100.0, 0.0).
Example C-2 What is a fair price for this financial opportunity? (Xn ) is
a random walk on the graph, only the five indicated states have nonzero
payoff, and the center is absorbing.
$10 ...........
.................................... $50
• . .
............
........
..... ...................... • ........
..
...............
....... ..... ......
.. ..
... ... .... .......... ...
........ ..... .... .........
...
.....
..
.
...
... ....
...
...
•.............
.
....
...
...
..... ....
...
...
...
...
...
...
... ... ... ... ... ... ...
.. .. . ....
....
.
... ...........
.. .•
...
.
.
...........
...
... ... ...
....... ... .. .. .......
....... .. .. .. .......
.
•
...... .........
......
......
...
...
.
............. ....... ... .. ......... ...... ....
. . .. .... .
start • ............
... ........
... ......
...... .............
..•
......................................................................................................................................................................
.
.. ..
...... .. .... .......
.
.
....... .... ... ..... ............
.. .
...
....... ...........
..
...... ..
...... ...
• $100
.. ..
... .... .....

...
...
...
• .....
...
... ..
...
..
.... ..... • ......
..
...
..
.
...
..
.
... ... ... ... ...

... .. ...
... ... ....
...
....
.....
... ....
. . . • ...
.. ....
..... .... .... ............. ............ ..... .... .........
...... .. .. .......
.. .
....... ... .. .....
. .......
.
• ................
.........
............
...........................
•......
.................
...........
$80 $80
Programming this problem in Maple, and running the algorithm to time
100, we get the vector
(32.42, 25.36, 50.00, 100.00, 80.00, 80.00, 19.26, 25.12, 50.00, 60.00, 53.33, 37.47, 0.00),
and find that the answer is $32.42.
D Two variations
1. Take an optimal stopping problem as before but suppose that to

leave state x costs g(x) dollars. Start with u1 as above and put
un (x) = max(P un−1 (x) − g(x), f(x)). The value function v for the
optimal strategy satisfies v(x) = limn un (x).
2. Take an optimal stopping problem as before but suppose there is
a discount factor α ≤ 1 so that one dollar today is only worth
α after one time unit. Start with u1 as above and put un (x) =
max(αP un−1 (x), f(x)). The value function v for the optimal strat-
egy satisfies v(x) = limn un (x).
If you have both costs and discounting you use the formula
un (x) = max(αP un−1 (x) − g(x), f(x)).
E The binomial pricing model
In this section, we will look at an example from mathematical finance. It’s

a little more complicated than our previous models, but it uses the same
principles.
We suppose a market with 3 financial instruments: bond (riskless asset),
stock (risky asset), and an option (to be explained). In one time period,
the value of the stock could go up or down, by being multiplied by the
constant d or u. In contrast, the bond grows at fixed rate r independent
of the market. Here r = 1 + interest rate and we assume that d < r < u to
preclude the possibility of arbitrage (riskfree profit).
..........
..........
• uS, rB, Cu
..........
..
....
...............
.........
..........
..........
..........
..........
C, B, S •...................
..........
..........
..........
..........
..........
..........
..........
..........
..........
• dS, rB, Cd
..........
Time 0 Time 1
How to not make money
Now imagine a portfolio that consists of x units of stock, y units of options,

and z units of bonds. The numbers can be positive, zero, or negative; for
example they may represent debt.
Suppose that we can arrange it so that this portfolio is absolutely worthless
at time 1, whether stocks go up or down

x(uS) + y Cu + z rB = 0
x(dS) + y Cd + z rB = 0
The fair time-zero price for the worthless portfolio is also zero, so that
x S + y C + z B = 0. Solving these 3 equations for the unknown C gives
º² ³ ² ³ »
1 r−d u−r
C= Cu + Cd .
r u−d u−d
To ease the notation let p = (r − d)/(u − d) so that the price can be written
C = (p Cu + (1 − p) Cd )/r. The “worthless portfolio” device is the same as
the usual “replicating portfolio” device.
Call Option
A call option gives the holder the right (but not the obligation) to buy
stock at a later time for K dollars. The value K is called the strike price.
The value of the option at time 1 is given by
Cu = (uS − K)+ Cd = (dS − K)+
Example E-1 Suppose S = 100, r = 1.05, u = 2, d = 1/2. This gives a

p-value of p = 11/30. If the strike price is K = 150, then Cu = 50 and
Cd = 0. Therefore the option price is
º² ³ ² ³ » ² ³
1 11 19 1 55
C= 50 + 0 = = 17.46
r 30 30 r 3
Just to doublecheck: the replicating portfolio consists of 1/3 share of stock

and a loan of 1r (50/3) dollars. If the stock goes down, it is worth 50/3
dollars which you use to pay off the loan, leaving a profit of zero. On the
other hand, if the stock goes up it is worth 200/3, you pay off the loan and
retain 150/3 = 50 dollars profit.
The time zero value of this portfolio is
² ³ ² ³
100 1 50 1 55
− = .
3 r 3 r 3
Clearly the price of the call option is non-negative; C ≥ 0. Also

1
C = fp (uS − K)+ + (1 − p)(dS − K)+ g
r
1
≥ fp (uS − K) + (1 − p)(dS − K)g
r
K
= S−
r
≥ S−K
Combining these two inequalities shows that C ≥ (S − K)+ .
Definition E-1 An American option can be exercised at any time, a Eu-

ropean option can only be exercised at the terminal time.
For American options, the price is the maximum of the current payoff and
the formula calculated earlier. For a call option,
² ³
1
C = max (S − K)+ , fpCu + (1 − p)Cd g
r
1
= fpCu + (1 − p)Cd g .
r
A call option is never exercised early, so there is no difference between
American and European call options.
A put option gives the buyer the right (but not the obligation) to sell stock
for K dollars. That is,
Pu = (K − uS)+ Pd = (K − dS)+
and
1
PEuro = fpPu + (1 − p)Pd g
r
² ³
1
PAmer = max (K − S)+ , fpPu + (1 − p)Pd g .
r
Example E-2 Suppose again that S = 100, r = 1.05, u = 2, d = 1/2.

This gives a p-value of p = 11/30. If the strike price is K = 150, then
Pu = 0 and Pd = 100. Therefore the price for an American put is

² º² ³ ² ³ »³
1 11 19
PAmer = max 50, 0+ 100 = 60.31.
1.05 30 30
In this example, PEuro = 60.31 is the same.
Multiple time periods.
Call Option
650
. .................
..........
.......... •
..........
........
650
..
257.14 ..........
..........
..........
..........
..........
..........
..........
.
•...................
.......... ...............
..........
..........
..........
..........
100.33 ................ .......
250 ..........
..........
..........
.......... ..........
50
• •
...... ..........
.
.............. ..................
..
......... ............... ....... .......
....... .... .......... ............
.............
..........
...... .
..............
.
... ............ ........ .....................
50
38.71 ..........
..........
........ ..
.......
50 ...............
..........
..........
17.46 ..........
..........
..........
.........
•. ..........
..........................
..........
..........
..........
.......... ..........
..........
..........
..........
. .......•
..........................
..... ...........
..........
..........
..........
..........
..........
0 ..........
..........
.......... 6.10 ..... ................... 0 ..........
..........
..........
.......... ...... ..........
0
• •
.......... ....
....................... ..........
............
..........
.......... .......... ...
.......... ...........
..........
..........
.......... .................... 0
0 ..........
..........
..........
..........
0
.......... .....................
..........
...........
..........
• ...........
..........
..........
..........
..........
..........
0 ..........
..........
..........
..........
0
•
..........
........
Time 0 1 2 3
This tree explains the price of a call option with terminal time n = 3.
The red number is the payoff if the option is exercised immediately while
the green number is the value of the call option. These are calculated by
starting at the right hand side and working left, using our formula. The
end result is relatively simple, since a call option is never exercised early.
( n ² ³ )
1 X n j
C= n p (1 − p)n−j (uj dn−j S − K)+
r j=0
j
American Put Option

0
................
..........
..........
..........•
.........
0
...
0 ..........
..........
..........
..........
.
..........
..........
..........
..........
.
•
....
...
. ..................
..........
..........
..........
..........
..........
36.38 .
. ...
........ .......
. .
0 ..........
..........
..........
......... ..........
0
• •
........ ..........
...
....
......... ................
.....
....... ................ .................
...... ... .......... ..........
.
..........
..............
. ....
..... .
. ...
..........
............... ......... ................. .......
0
73.02 ..........
..........
.......... 0 ..........
..........
.......... 60.32 ..........
..........
..........
• •
..........
.......................... .......... ...................
.
. .......... .....
.......... ......... ...................
.......... .......... ..........
.......... .......... ..........
.......... .......... ..........
..........
0 ..........
..........
.......... 100 ............... .........
0 ..........
..........
..........
..........
.......... ....................
.... ..........
100
• ..............
..........
..........
..........
..........
.......... ..... ........... .......
..........
........... •
.......... ...
.
.............
100
........
100 ..........
..........
..........
..........
125 ..........
...........
... .........
.......... ....................
• ..........
..........
..........
..........
..........
..........
125 ..........
..........
..........
• 137.50
..........
..........
.......
137.50
Time 0 1 2 3
This tree explains the pricing of an American put option with terminal
time n = 3. The red number is the payoff if the option is exercised imme-
diately while the green number is the current value of the option. These
are calculated by starting at the right hand side and working left, using
our formula, but always taking the maximum with the result of immediate
exercise. There are boxes around the nodes where the option would be
exercised; note that two of them are early exercise.
What’s the connection with optimal stopping for Markov chains?
Define a Markov chain on the tree. At each time, the chain moves forward.
It goes upward (north-east) with probability p and downwards (south-east)
with probability 1 − p. The ends are absorbing.
Let f(x) be the value of the option if exercised immediately at state x. Set
the discount rate to be 1/r.
For a state x at (terminal) time n, we have u1 (x) = f(x) = v(x). The

approximation is already perfect.
Next consider a state x at time n−1. We have u2 (x) = max(P u1 (x)/r, f(x)).
Now u1 ≡ v at time n so P u1 ≡ P v at time n−1. So u2 (x) = max(P v(x)/r, f(x)) =
v(x) for such x.
..........
..........
.......... •
.......... y
x • ...
.
...............
.....
..........
..........
..........
(P g)(x) = pg(y) + (1 − p)g(z)
..........
•.......
z
Time = n − 1 n
Continuing this way, we can prove that uj+1 (x) = v(x) for states x at time
n − j.
The algorithm of starting at the far right hand side and working backwards
gives us the value function v, which gives the correct price of the option at
each state. We exercise the option at any state where v(x) = f(x).
4 MARTINGALES 57
4 Martingales
A Conditional Expectation
What’s the best way to estimate the value of a random variable X? If we

want to minimize the squared error
E[(X − e)2 ] = E[X 2 − 2eX + e2 ] = E(X 2 ) − 2eE(X) + e2 ,
differentiate to obtain 2e − 2E(X), which is zero at e = E(X).
Example A-1 Your friend throws a die and you have to estimate its value
X. According to the analysis above, your best bet is to guess E(X) = 3.5.
What happens if you have additional information? Suppose that your

friend will tell you the parity of the die value, that is, whether it is odd or
even? How should we modify our guess to takenthis new information into
0 if X is even
account? Let’s define the random variable P = .
1 if X is odd
Then
6
X
E(X j X is even) = xP(X = x j X is even)
x=1
² ³ ² ³ ² ³
1 1 1
= (1 × 0) + 2 × + (3 × 0) + 4 × + (5 × 0) + 6 ×
3 3 3
= 4.
Similar calculations show that E(X j X is odd) = 3.

º
4 if p = 0
We can combine these results as follows. Define a function φ(p) = .
3 if p = 1
Then our best estimate is the random variable φ(P ).
In an even more extreme case, your friend may tell you the exact result X.
In that case your estimate will be X itself.
4 MARTINGALES 58
Information Best estimate of X
none E(X j no info) = φ φ ≡ E(X)

º
4 if p = 0
partial E(X j P ) = φ(P ) φ(p) =
3 if p = 1
complete E(X j X) = φ(X) φ(x) = x
Example A-2 Suppose you roll two fair dice and let X be the number on
the first die, and Y be the total on both dice. Calculate (a) E(Y j X) and
(b) E(X j Y ).
(a)
X
E(Y j X)(x) = yP(Y = y j X = x)
y
6
X 1
= (x + w) = x + 3.5,
w=1
6
so that E(Y j X) = X + 3.5. The variable w in the sum above stands for
the value on the second die.
(b)
X
E(X j Y )(y) = x P(X = x j Y = y)
x
X P(X = x, Y = y)
= x
x
P(Y = y)
X P(X = x, Y − X = y − x)
= x
x
P(Y = y)
X P(X = x)P(Y − X = y − x)
= x
x
P(Y = y)
4 MARTINGALES 59
Now
1
P(X = x) = , 1 ≤ x ≤ 6,
 6
y−1

 2≤y≤7
36

P(Y = y) =
 13 − y 8 ≤ y ≤ 12


36
and
1
P(Y − X = y − x) = , y − 6 ≤ x ≤ y − 1.
6
For 2 ≤ y ≤ 7 we get
y−1 y−1
X 1/36 1 X 1 (y − 1)y y
E(X j Y )(y) = x = x= = .
x=1
(y − 1)/36 y − 1 x=1 y−1 2 2
For 7 ≤ y ≤ 12 we get
6 6
X 1/36 1 X y
E(X j Y )(y) = x = x= .
x=y−6
(13 − y)/36 13 − y x=y−6 2
Therefore our best estimate is E(X j Y ) = Y /2.
If X1 , X2 , . . . is a sequence of random variables we will use Fn to denote

“the information contained in X1 , . . . , Xn ” and we will write E(Y j Fn ) for
E(Y j X1 , . . . , Xn ).
Definition A-1 E(Y j Fn ) is the unique random variable satisfying the

following two conditions:
1. E(Y j Fn ) depends only on the information in Fn . That is, there is some

function φ so that
E(Y j Fn ) = φ(X1 , X2 , . . . , Xn ).
2. If Z is a random variable that depends only on Fn , then

E (E(Y j Fn )Z) = E(Y Z).
4 MARTINGALES 60
Properties:
1. E(E(Y j Fn )) = E(Y )
2. E(aY1 + bY2 j Fn ) = aE(Y1 j Fn ) + bE(Y2 j Fn )
3. If Y is a function of Fn , then E(Y j Fn ) = Y
4. For m < n, we have E(E(Y j Fn ) j Fm ) = E(Y j Fm )
5. If Y is independent of Fn , then E(Y j Fn ) = E(Y )
Example A-3 Let X1 , X2 , . . . be independent random variables with mean

µ and set Sn = X1 + X2 + · · · + Xn . Let Fm = σ(X1 , . . . , Xm ) and m < n.
Then
E(Sn j Fm ) = E(X1 + · · · + Xm j Fm ) + E(Xm+1 + · · · + Xn j Fm )

= X1 + · · · + Xm + E(Xm+1 + · · · + Xn )
= Sm + (n − m)µ
Example A-4 Let X1 , X2 , . . . be independent random variables with mean

µ = 0 and variance σ 2 . Set Sn = X1 + X2 + · · · + Xn . Let Fm =
σ(X1 , . . . , Xm ) and m < n. Then
E(Sn2 j Fm ) = E((Sm + (Sn − Sm ))2 j Fm )

= 2
E(Sm + 2Sm (Sn − Sm ) + (Sn − Sm )2 j Fm )
2
= Sm + 2Sm E(Sn − Sm j Fm ) + E((Sn − Sm )2 )
2
= Sm + 0 + Var (Sn − Sm )
= Sm + (n − m)σ 2
2
4 MARTINGALES 61
Example A-5
Example A-6 The Markov Property

Let (Xn ) be a Markov chain on S with transition matrix P . If n, m ≥ 0
are integers and f : S → R, then
E(f(Xm+n ) j Fm ) = (P n f)(Xm ).
This equation defines time-homogeneous Markov processes.

4 MARTINGALES 62
B Martingales
Let X0 , X1 , . . . be a sequence of random variables and define Fn = σ(X0 , X1 , . . . , Xn )

to be the “information in X0 , X1 , . . . , Xn ”. The family (Fn )∞ n=0 is called
the filtration generated by X0 , X1 , . . .. A sequence Y0 , Y1 , . . . is said to
be adapted to the filtration if Yn ∈ Fn for every n ≥ 0, i.e., Yn =
φn (X0 , . . . , Xn ) for some function φn .
Definition B-1 The sequence M0 , M1 , . . . of random variables is called a

martingale (with respect to (Fn )∞
n=0 ) if
(a) E(jMn j) < ∞ for n ≥ 0.

(b) (Mn )∞ ∞
n=0 is adapted to (Fn )n=0 .
(c) E(Mn+1 j Fn ) = Mn for n ≥ 0.
Note that if (Mn ) is an (Fn ) martingale, then E(Mn+1 − Mn j Fn ) = 0 for

all n. Therefore if m < n,
n−1
X
E(Mn − Mm j Fm ) = E( Mj+1 − Mj j Fm )
j=m
n−1
X
= E( E(Mj+1 − Mj j Fj ) j Fm )
j=m
= 0
so that E(Mn j Fm ) = Mm .
Another note: Suppose (Mn ) is an (Fn ) martingale, and define FnM =

σ(M0 , M1 , . . . , Mn ). Then Mn ∈ FnM for all n, and FnM ⊆ Fn . Therefore
E(Mn+1 j FnM ) = E(E(Mn+1 j Fn ) j FnM ) = E(Mn j FnM ) = Mn ,
so (Mn ) is an (FnM )-martingale.
Example B-1 Let X1 , X2 , . . . be independent random variables with mean

µ. Put S0 = 0 and Sn = X1 + · · · + Xn for n ≥ 1. Then Mn := Sn − nµ is
an (Fn ) martingale.
4 MARTINGALES 63
Proof:
E(Mn+1 − Mn j Fn ) = E(Xn+1 − µ j Fn )
= E(Xn+1 j Fn ) − µ
= µ−µ
= 0.
Example B-2 Martingale betting strategy
Let X1 , X2 , . . . be independent random variables with P(X = 1) = P(X =

−1) = 1/2. These represent the outcomes of a fair game that we will bet
on. We start with a one dollar bet, and keep doubling our bet until we win
once, then stop.
Let W0 = 0 and, for n ≥ 1, Wn be our winnings after n bets: this is either
equal to 1 or to −(1 + 2 + · · · + 2n−1 ) = 1 − 2n .
If Wn = 1, then Wn+1 = 1 also since we’ve stopping betting. That is,
E(Wn+1 j Wn = 1) = 1 = Wn . On the other hand, if Wn = 1 − 2n , then we
bet 2n dollars so that
1 1
P(Wn+1 = 1 j Wn = 1 − 2n ) = , P(Wn+1 = 1 − 2n+1 j Wn = 1 − 2n ) = .
2 2
Putting this together we get
1 1
E(Wn+1 j Wn = 1 − 2n ) = (1) + (1 − 2n+1 ) = 1 − 2n = Wn .
2 2
Thus E(Wn+1 j Wn ) = Wn , and by the Markov property we have E(Wn+1 j

Fn ) = Wn , so Wn is a martingale.
Example B-3 A more complex betting strategy
Let X1 , X2 , . . . be as above, and suppose you bet Bn dollars on the nth

play. We insist that Bn ∈ Fn−1 since you can’t peek into the future. Such
a (Bn ) process is called predictable.
Your winnings are given by Wn = nj=1 Bj Xj so that
P
E(Wn+1 − Wn j Fn ) = E(Bn+1 Xn+1 j Fn )

4 MARTINGALES 64
= Bn+1 E(Xn+1 j Fn )
= Bn+1 E(Xn+1 )
= 0,
so (Wn ) is again a martingale. This is a discrete version of stochastic

integration with respect to a martingale.
Note that example 2 is the case where we set B1 = 1 and, for j > 1,
º
j−1
Bj = 2 if X1 , X2 , . . . , Xj−1 = −1 .
0 otherwise
Example B-4 Polya’s urn
Begin with an urn that holds two balls: one red and the other green. Draw
a ball at random, then return it with another of the same colour.
Define Xn to be the number of red balls in the urn after n draws. Then
Xn is a time inhomogeneous Markov chain with
k
P(Xn+1 = k + 1 j Xn = k) =
n+2
k
P(Xn+1 = k j Xn = k) = 1 −
n+2
This gives
² ³ ² ³
k k n+3
E(Xn+1 j Xn = k) = (k + 1) +k 1− =k ,
n+2 n+2 n+2
so that E(Xn+1 j Xn ) = Xn (n + 3)/(n + 2). From the Markov property we
get ² ³
n+3
E(Xn+1 j Fn ) = Xn ,
n+2
and dividing we obtain
² ³
Xn+1 Xn
E j Fn = .
(n + 1) + 2 n+2
If we define Mn = Xn /(n + 2), then (Mn ) is a martingale. Here Mn stands
for the proportion of red balls in the urn after the nth draw.
4 MARTINGALES 65
C Optional sampling theorem
Definition C-1 A process (Xn ) is called a supermartingale if it is adapted

and E(Xn+1 j Fn ) ≤ Xn for n ≥ 0. A process (Xn ) is called a submartingale
if it is adapted and E(Xn+1 j Fn ) ≥ Xn for n ≥ 0.
Definition C-2 A random variable τ with values in f0, 1, . . .g ∪ f∞g is

called a stopping time if fω : τ (ω) ≤ ng ∈ Fn for n ≥ 0.
Theorem C-1 If (Xn ) is a supermartingale and 0 ≤ σ ≤ τ are bounded

stopping times, then E(Xτ ) ≤ E(Xσ ).
Proof: Let k be the bound, i.e., 0 ≤ σ ≤ τ ≤ k. We prove the result by

induction on k. If k = 0, then obviously the result is true.
Now suppose the result is true for k − 1. Write
E(Xσ − Xτ ) = E(Xσ∧(k−1) − Xτ ∧(k−1) ) − E((Xk − Xk−1 )1fσ≤k−1, τ =kg ).
The first term on the right is non-negative by the induction hypothesis. As

for the second term, note that
fσ ≤ k − 1, τ = kg = fσ ≤ k − 1g ∩ fτ ≤ k − 1gc ∈ Fk−1 .
Therefore
¡
E((Xk − Xk−1 )1fσ≤k−1, τ =kg ) = E E(Xk − Xk−1 j Fk−1 )1fσ≤k−1, τ =kg ≤ 0,
which gives the result. u

t
Optional sampling theorem. If (Mn ) is a martingale and T a finite stopping

time, then under suitable conditions E(M0 ) = E(MT ).
Proof: For each k, Tk := T ∧ k is a bounded stopping time so that

E(M0 ) = E(MTk ). But
E(MT ) = E(MTk ) + E((MT − Mk )1fT >kg ),

4 MARTINGALES 66
so to prove the theorem you need to argue that

E((MT − Mk )1fT >kg ) → 0 as k → ∞.
u
t
Warning. The simple symmetric random walk Sn is a martingale, and

T := inf(n ≥ 0 : Sn = 1) is a stopping time with P(T < ∞) = 1. However
E(S0 ) = 0 6= E(ST ) = 1, so the optional sampling theorem fails.
Analysis of random walk using martingales
Let X1 , X2 , . . . be independent with P(X = −1) = q, P(X = 1) = p, and

P(X = 0) = 1 − (p + q). Note that µ = p − q and σ 2 = p + q − (p − q)2 .
Let S0 = j and Sn = S0 + X1 + · · · + Xn and define the stopping time
T := inf(n ≥ 0 : Sn = 0 or Sn = N ) where we assume that 0 ≤ j ≤ N .
1. (Case p = q) Since (Sn ) is a martingale, we have
E(S0 ) = E(ST )
j = 0 × P(ST = 0) + N × P(ST = N )
which implies that P(ST = N ) = j/N .
2. (Case p 6= q) Now (Sn − nµ) is a martingale, we have

E(S0 − 0µ) = E(ST − T µ)
j = N × P(ST = N ) − E(T )µ
which unfortunately leaves us with two unknowns. To overcome this prob-
lem, we introduce another martingale: Mn = (q/p)Sn (check that this really
is a martingale!). By optional sampling
À² ³ ! À² ³ !
S S
q 0 q T
E = E
p p
² ³j ² ³0 ² ³N
q q q
= P(ST = 0) + P(ST = N )
p p p
² ³j ² ³N
q q
= 1 − P(ST = N ) + P(ST = N ).
p p
4 MARTINGALES 67
A little algebra now shows that
1 − (q/p)j 1 − (q/p)j
² ´ µ ³
−1
P(ST = N ) = and E(T ) = (p − q) N −j
1 − (q/p)N 1 − (q/p)N
3. (Case p = q) Now (Sn2 − nσ 2 ) is a martingale, we have
E(S02 − 0µ) = E(ST2 − T σ 2 )

j 2 = N 2 × P(ST = N ) − E(T )σ 2
Substitute P(ST = N ) = j/N and solve to obtain
j(N − j)
E(T ) =
p+q
Probability of ruin: p = 1/2, q = 1/2, p = 9/19, q = 10/19
1.00 ........................................
......... ..............
......... ..
0.75 ......... ......................
......... .........
......... ........
......... ........
...
.......... .....
0.50 ......... ............
......... .
......... ..........
0.25 ......... .....
......... ....
.............
0.00 ......
0 5 10 15 20
Starting point
4 MARTINGALES 68
Average length of game: p = 1/2, q = 1/2, p = 9/19, q = 10/19
........................
100
..
.....................................................................
.
..... ....... ..............
75 ..
...... ............ ..... .....
..... .....
..
.. ............ .... ....
.
.
.. ...... .... ....
50 ...
. . . .... ...
... ..... ........
. .
. ... ......
.. . ......
25 .. ....... ......
.....
......... ......
0 .. ....
0 5 10 15 20
Starting point
Example C-1 Waiting for patterns: In tossing a fair coin, how long on
average until you see the pattern HT H?
Imagine a gambler who wants to see HT H and follows the “play until you
lose” strategy: at time 1 he bets one dollar, if it is T he loses and quits,
otherwise he wins one dollar. Now he has two dollars to bet on T , if it is
H he loses and quits, otherwise he wins two more dollars. In that case, he
bets his four dollars on H, if it is T he loses and quits, otherwise he wins
four dollars and stops.
His winnings Wn1 form a martingale with W01 = 0.
Now imagine that at each time j ≥ 1 another gambler begins and bets
on the same coin tosses using the same strategy. These guys winnings are
labelled Wn2 , Wn3 , . . . Note that Wnj = 0 for n < j.
Define Wn = ∞ j
P
j=1 Wn the total winnings at time n and let T be the first
time the pattern is completed. By optional sampling E(WT ) = E(W0 ) = 0.
From the casino’s point of view this means that the average income equals
the average payout.
Income: $1 $1 $1 $1 ··· $1 $1 $1 $1
Coin tosses: H T T H ··· H H T H
Payout: $0 $0 $0 $0 ··· $0 $8 $0 $2
4 MARTINGALES 69
Examining this diagram, we see that the total income is T dollars, while
the total payout is 8 + 2 = 10 dollars, and conclude that E(T ) = 10.
Fortunately, you don’t need to go through the whole analysis every time
you solve one of these problems, just figure out how much the casino has to
pay out. For instance, if the desired pattern is HHH, then the casino pays
out the final three bettors a total of 8 + 4 + 2 = 14 dollars, thus E(T ) = 14.
Example C-2 If a monkey types on a keyboard, randomly choosing letters,

how long on average before we see the word MONKEY? Answer: 266 =
308915776.
Example C-3 Guessing Red: A friend turns over the cards of a well
shuffled deck one at a time. You can stop anytime you choose and bet that
the next card is red. What is the best strategy?
Solution: Let Rn be the number of red cards left after n cards have been
turned over. Then
º
Rn with probability 1 − p
Rn+1 = ,
Rn − 1 with probability p
where p = Rn /(52 − n), the proportion of reds left. Taking expectations

we get ² ³
Rn 52 − (n + 1)
E(Rn+1 j Rn ) = Rn − = Rn
52 − n 52 − n
so that ² ³
Rn+1 Rn
E j Rn = .
52 − (n + 1) 52 − n
This means that Mn := Rn /(52 − n) is a martingale.
Now let T represent your stopping strategy. Then P(T is successful j FT ) =
MT , and by the optional sampling theorem,
P(T is successful) = E(MT ) = E(M0 ) = 1/2.
Every strategy has a 50% chance of success!

4 MARTINGALES 70
Example C-4 An application to linear algebra: First note that if (Xn ) is

a Markov chain, and v a superharmonic function, then the process v(Xn )
is a supermartingale.
X
E(v(Xn+1 ) j Fn ) = E(v(Xn+1 ) j Xn ) = v(y)p(Xn , y) ≤ v(Xn ).
y∈S
P
Theorem C-2 Let P be an n × n matrix with pij > 0 and j pij = 1 for
all i. Then the eigenvalue 1 is simple.
Proof: Let (Xn ) be the Markov chain with transition matrix P on state
space S = f1, 2, . . . , ng. A function u : S → R is harmonic if and only if the
vector u = (u(1), u(2), . . . , u(n))T satisfies P u = u, i.e., u is a right eigen-
vector for the eigenvalue 1. Clearly the constant functions are harmonic;
we want to show that they are the only ones.
Suppose u is a right eigenvector so that u : S → R is harmonic and u(Xn )
is a (bounded!) martingale. Let x, y ∈ S and Ty := inf(n ≥ 1 : Xn = y) be
the first time the chain hits state y. Since the chain is irreducible, we have
Px (Ty < ∞) = 1 and so
u(y) = Ex (u(XTy )) = Ex (u(X0 )) = u(x).
u
t
D Martingale convergence theorem
Theorem D-1 If (Mn ) is a martingale with supn E(jMn j) < ∞, then there
is a random variable M∞ so that Mn → M∞ .
Proof: It suffices to show that for any −∞ < a < b < ∞, the probability
that (Mn ) fluctuates infinitely often between a and b is zero. To see this, we
define a new martingale (Wn ) which is the total “winnings” for a particular
betting strategy.
4 MARTINGALES 71
The strategy is to wait until the process goes below a, then keep betting
until the process goes above b, and repeat. The winnings on the jth bet is
Mj − Mj−1 so that
n
X
W0 = 0, Wn = Bj (Mj − Mj−1 ), n ≥ 1.
j=1
The following diagram explains the relationship between the two martin-
gales.
M process
.
....
◦
◦...............................◦...............................................................................................................................◦.........................................................................................................◦................................................................................................................................................................................................................
.. .
b .... ...
...... ...
◦
...
...
...
•............
..
. ..
.
...
...
...
... ◦ ...
.
...
.
..
◦ ..
.
...
.
.
•.......
... ..
... ... .....
...
...
.
...
.. ◦ .
.
...
..
.•
...........
..... .....
..
...
... ...
..
a ...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
... .. .. ..... ..... .. ... ..
...
... • .......•....
................ ....
• ...
• ......
•
........
. ...
...
...
...
.
... ... .
... ... jM
n − aj
..
... ..
... .. .
... ..
... .
• .
....
W process ... •
... ...
... ... ...
... ....
..
...
... .....
..
..•
..........
..... .........
.....
..
...
.
.... ...
...
...
... ..... ..... .....
..... ...
...
...
...
◦...
.
... ◦ ◦ • .
..
.. .......
• ...
...
...
... ... ...
... ... ...
...
... ... ...
... ... ...
... ... ...
...
...
... ...
..
.
...
..
•
... ...
... ..
.
... ...
... ...
...
...
... ......
......
◦
..
...... ◦ ◦ • ...
...
...
...
•............
...
... ...
... ...
... ..
... .
...
.
... ...
... ...
... ...
◦ ...
...
...
◦ ◦ • ................ ....
•
...........
...
...
..
Notice that Wn ≥ (b − a)Un − jMn − aj, where Un is the number of times

that (Mn ) “upcrosses” the interval [a, b] up to time n. Therefore
0 = E(W0 ) = E(Wn ) ≥ (b − a)E(Un ) − E(jMn − aj),

4 MARTINGALES 72
so that E(Un ) ≤ E(jMn − aj)/(b − a). Taking the supremum in n on both

sides gives E(U∞ ) < ∞, which shows that P(U∞ = ∞) = 0.
So Mn → M∞ , but M∞ may take the values −∞ or ∞. Luckily Fatou’s
lemma comes to the rescue, showing us that E(jM∞ j) ≤ lim inf n E(jMn j) <
∞. Thus P(jM∞ j < ∞) = 1 and limn Mn exists and is finite. u
t
Although E(M0 ) = E(Mn ) for all n, we cannot always let n → ∞ to

conclude that E(M0 ) = E(M∞ ).
Example D-1
(a) Polya’s urn. Let Mn be the proportion of red balls in Polya’s urn at
time n. Then (Mn ) is a martingale and 0 ≤ Mn ≤ 1, so supn E(jMn j) ≤ 1.
Therefore Mn → M∞ for some random variable M∞ . It turns out that M∞
has a uniform distribution on (0, 1).
Three sample paths from Polya’s urn
1.00
...................
......... .................... ...................... ........... ........... ..........................................................................................................
....... ......... ............................................... ................................................................
..
...... .........
.
....
...
...
...
0.75 ....
...
..
...
...
....
..
...
0.50 ....
...
...
...
...
...
...
...
...
... ...... ..........................................
...... . ...................................... ..................................... ..................... .........................................
0.25 ... ... ....
... .. .....
....
....... .. .......
... ... ... ...... ................ .... .. ............... ................
......... ..................
.... ....... . ..... ........................
..... .. ...... .. .. ........ .. ............ ............... ................ ... .
.. ..
.. ........................................................................................................
..... ... . .
....... ...... . ......... .. . ..
................. ................. .................................................. ......... .. ... .
..... ....... ...
0.00
0 25 50 75 100
(b) Branching process. Let Xn be a branching process, and put Mn :=

Xn /µn so that (Mn ) is a martingale. If µ ≤ 1, then Mn → M∞ = 0
(extinction). If µ > 1 then Mn → M∞ where one can prove that E(M∞ ) =
1.
4 MARTINGALES 73
(c) Random harmonic series.

The harmonic series diverges, but not the alternating harmonic series:
1 1 1 1
1+ + + + · · · + + · · · = ∞.
2 3 4 j
1 1 1 1
1− + − + · · · + (−1)j+1 + · · · = ln 2.
2 3 4 j
Here the positive and negative terms partly cancel, allowing the series to
converge.
Let’s choose plus and minus signs at random, by tossing a fair coin. For-
mally, let (εj )∞
j=1 be independent random variables with common distribu-
tion P(εj = 1) = P(εj = −1) = 1/2. P Then the martingale convergence
theorem shows that the sequence Mn = nj=1 εj /j converges almost surely.
The limit M∞ := ∞
P
j=1 εj /j has the smooth density pictured below.
...............................................
0.25
........ ....
.. ...
0.20 .. . ...
... ...
.. ...
0.15 ...
.. ...
.. ...
0.10 .. ...
.. ...
0.05 .. ...
.
. . ....
...
... .......
0.00 ...... .....
−3 −2 −1 0 1 2 3
P∞
Density of j=1 εj /j.
(d) Recurrent Markov chain. Let (Xn ) be an irreducible, recurrent

Markov chain on a countable state space S. Suppose that u is a bounded
harmonic function. Then Mn := u(Xn ) is a bounded martingale and so
Mn → M∞ as n → ∞. But Xn is recurrent and visits every state infinitely
often, so u(Xn ) can only be convergent if u is constant.
5 CONTINUOUS TIME MARKOV CHAINS 74
5 Continuous time Markov chains
A Poisson process
An insurance company is keeping track of claims that come in haphazardly,

i.e., at random times. They monitor the cumulative total number of claims
by time t, that is,
Xt = The number of claims made up to and including at time t.
The random variables Xt have a continuous time parameter t ∈ [0, ∞),

and the family (Xt )t≥0 is called a Poisson process. A Poisson process can
equally well serve as a model for the number of customers that have arrived
at a store by time t, or the number of requests made to a computer server
by time t. Any situation where Xt represents the total number of events
that have occurred during the time interval [0, t] could be modelled in the
same way.
A typical sample path of this process looks like the broken line graph below.
The dots on the t-axis are the times when events occur.
Xt (ω)
4 →
.....................................
...
..
...
...
...........................................................................................................................................
3 ..
....
..
...
.
................................................................
2 ....
...
...
..
1 .............
....
...
...
..
0 ••
............................................. • • →
t
The Poisson process is our standard model for random movement that
jumps. We get a Poisson process (Xt ) by assuming
(1) Independent increments: for s1 < t1 < s2 < t2 < · · · < sn < tn the
random variables Xt1 − Xs1 , . . . , Xtn − Xsn are independent.
(2) Stationarity: the distribution of Xt − Xs depends only on t − s.

(3) Start at zero: X0 = 0.

(4) One jump at a time: For a small time increment ∆t,
P(Xt+∆t = Xt ) = 1 − λ∆t + o(∆t)
P(Xt+∆t = Xt + 1) = λ∆t + o(∆t)
P(Xt+∆t ≥ Xt + 2) = o(∆t).
These probability formulas are approximations, but it turns out that they
imply more precise formulas. In fact, Xt has a Poisson distribution with
parameter λt, that is,
(λt)k
P(Xt = k) = e−λt .
k!
To recap, the number of events that occur during [0, t] is a Poisson random
variable with mean λt. The same is true of the number of events that occur
during [s, s + t] for any s ≥ 0. This follows since the process (Xt+s − Xs )t≥0
satisfies conditions (1), (2), and (3) and hence is a Poisson process itself.
Example A-1 Claims arrive at an insurance company following a Poisson

process with λ = 1 claim per day.
(a) Calculate the probability that, over five days, the company receives
exactly five claims.
(b) Calculate the probability that, over five days, the company receives
exactly one claim each day.
Solution: (a) P(X5 = 5) = e−5 55 /5! = .17547.

(b) P(X1 = 1, X2 = 2, X3 = 3, X4 = 4, X5 = 5)
= P(X1 = 1, X2 − X1 = 1, X3 − X2 = 1, X4 − X3 = 1, X5 − X4 = 1)
= P(X1 = 1)P(X2 − X1 = 1)P(X3 − X2 = 1)P(X4 − X3 = 1)P(X5 − X4 = 1)
= P(X1 = 1)5
= (e−1 )5
= .006738
Let’s look at the time between events. Define T1 to be the time until the
first event, and for j > 1, let Tj be the time elapsed between the (j − 1)st
event and the jth event. Notice that P(T1 > t) = P(Xt < 1) = e−λt , so
that T1 has an exponential distribution with mean 1/λ. It’s a bit harder
to prove rigorously, but the sequence (Tj )j≥1 are independent, identically
distributed exponential variables.
Example A-2 Suppose that at time s we are still waiting for the first event.
What is the probability that we need to wait for more than t additional
time units?
Solution. We need to calculate the conditional probability

P(T1 > t + s)
P(T1 > t + s j T1 > s) =
P(T1 > s)
e−λ(t+s)
=
e−λs
= e−λt
= P(T1 > t).
The chance of waiting an additional t time units is the same whether you
start at time zero, or you have been waiting for s time units already. This
is an example of the memory-less property of the exponential distribution.
B Continuous time chains with finite state space
Let (Yn ) be a Markov chain with finite state space S and transition matrix
P . Let (Nt ) be an independent Poisson process with rate λ, and define the
continuous time process Xt = YNt .
This process has the Markov property
E(f(Xt+s ) j Xr , 0 ≤ r ≤ s) = Pt f(Xs ),
where Pt = etλ(P −I) is the matrix exponential of A := λ(P −I). We have the
equation dtd Pt = APt and P0 = I, and we call the matrix A the infinitesimal
generator of the process (Xt ).
The matrix A has every row sum zero, non-negative entries off the diag-
onal and non-positive entries on the diagonal. Every such matrix is the
infinitesimal generator of some continuous time Markov chain.
For x 6= y ∈ S we define α(x, y) to be the (x, y) entry of A. This is the
rate
P at which the process changes from x to y. We also define α(x) =
y6=x α(x, y) the total rate at which the process moves from x.
Theorem B-1 If the chain (Xt ) is irreducible, then limt→∞ Pt (x, y) = π(y)
where π is the unique probability vector that satisfies πA = 0. These are
the steady state probabilities that give the long range fraction of time spent
in each state.
Theorem B-2 The sojourn time for state x is an exponential random

variable with mean µx = 1/α(x). A transition from state x occurs according
to the transition probabilities Q(x, y) = α(x, y)/α(x).
Example B-1 Let (Xt ) be a continuous Markov process with three states,
and transition rates as pictured below.
....................
...... ....
.... ...
... ...
.....
...
γ .
..
... .... ........
..... .. ..................
........................ . ..
....... ..............
.... ....... .......
....... ..............
... .........
... ...
....... 3
....... ..............
.......
... .. ....... ..............
.......
... ... ....... ..............
... ... .......
....... ..............
... ...
... ...
4 .......
....... ............ .........
..................
....
....... . . ...
..... .... ........... ...
.............. ... ...
... ... ..
6 4
... ... ...
... α ..
..
... ... ...
..
..
............. .... .
.....
... .. ....... . . .
...................... .
..
... ... .......
.... .... .......
....... .......
5 ....... .......
... ... ....... .......
... ... ...... ... ..........
.. ..
. ....... .......
......... ....
.......
....... .......
....... .......
....... ..
....... .......
.............. .......
.....
... .. ........ .......
....
. . ..
. ..
.......
5
.........
... ... .. .............
.. ... ...................
..... β . ...
.
... .
... ..
..... ...
..........................
α β γ
 
α −8 5 3
The infinitesimal generator is A = β  5 −9 4 .
γ 4 6 −10
We easily see that µα = 1/8, µβ = 1/9, and µγ = 1/10. The conditional
transition probabilities are given by the matrix
α β γ
 
5 3
α 0 
 8 8
 
P̃ = β 
 5 4.
0

 9 9
 4 6 
γ 0
10 10
66 68 47
Solving πA = 0 gives the steady state probabilities π = ( 181 , 181 , 181 ) =
(.3646, .3757, .2597).
To answer these questions, it is not necessary to calculate the matrix ex-
ponential Pt = etA . Maple can do it, but generally the results are not so
nice. For instance, here is one entry in Pt :
66 115 √ 209 √ √
Pt (α, α) = + exp((−27 + 5)t/2) + 5 exp((−27 + 5)t/2)
181 362 1810
209 √ √ 115 √
− 5 exp(−(27 + 5)t/2) + exp(−(27 + 5)t/2).
1810 362
Here is a graph of this function, which shows the rapid convergence to

equilibrium.
1.0 ...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
.....
.....
.....
0.5 .....
.....
......
.......
.........
.............
.......................
...................................................................................................................................................................................................................................................
0.0
0.00 0.25 0.50 0.75 1.00
6 BROWNIAN MOTION 79
6 Brownian motion
A Basic properties
Brownian motion is our standard model for continuous random movement.

We get a Brownian motion (Xt ) by assuming
(1) Independent increments: for s1 < t1 < s2 < t2 < · · · < sn < tn the
random variables Xt1 − Xs1 , . . . , Xtn − Xsn are independent.
(2) Stationarity: the distribution of Xt − Xs depends only on t − s.
(3) Continuous paths: the sample path t 7→ Xt (ω) is continuous with
probability 1.
Here the random variables Xt take values in the state space Rd for d ≥ 1,
the starting point X0 = x ∈ Rd can be anywhere, and E(Xt ) = µt for some
fixed µ ∈ Rd .
When µ 6= 0 we have Brownian motion with drift, while if d > 1 we call
(Xt ) multi-dimensional Brownian motion. Conditions (1)–(3) imply that
Xt has a (multivariate) normal distribution for t > 0.
Definition A-1 The process (Xt ) is called standard d-dimensional Brow-

nian motion when µ = 0 and the covariance matrix of Xt satisfies
E [(Xt − X0 )(Xt − X0 )0 ] = tI.
In this case, the coordinates of the vector Xt = (Xt1 , Xt2 , . . . , Xtd ) are inde-
pendent 1-dimensional Brownian motions.
From now on, we only consider the standard version of Brownian motion.
Brownian motion is a Markov process with transition kernel
1 2
P(Xt = y j X0 = x) = pt (x, y) = √ e−ky−xk /2t , y ∈ Rd .
( 2πt)d
This kernel satisfies the Chapman-Kolmogorov equation
Z
ps+t (x, y) = ps (x, z)pt (z, y) dz.
Three 2d Brownian paths to time 100
. .
. ........... ..........
•............................................................................................................................................. 10
.... ....... ...
.... ..
..
... . ..... ..
.................................................... .............
.... ............ ...............................................
.. .. .
... ................... . ..............
. ..... ............... ..
............ .......... ................. .................
.. .
............................................. ...............................................................................................
............................ .. ......................... ... .......
.......... .... ...................................
• ... . ...................... .................................................................................................
... ..............
..........
.... ................ .................
............ ................................ ........
.
......... ............................. ...
............. .......................... .... .................
................................. 5 ... . ..•
.....................
............................................
. . ..
...................... ... ..... .... ....
.
... .......... ........ ....
.......... .. ........................
.. .................... ... .....................
....... ..... ............
. . . . ................................
. ................... ............ .... ....
............ ........... ............................. .......................... ................... .. .... ......
....... ...................... .................... ........
..... ............................................. ..... . ..... ..............................................
............................ ... . .
............................................
.......................... ........................... . .. .
........... . . . . .
............................... ....... .................. ........................................................
.................................... ......................................................................... ..............
.......... .. ................................................ .. .................................... ...................
....... . ...... .
...... .. .......................... ..........................................................
. ... ......
................................................... .... ........... . .....
.......................................................... ....... .
........ ...... .. .. . .
.
.................... ................. ........................
-10 -5 . . .
................... ... ................... .....................................
............................. ............ ..
......... .
5 10
. ........................................ ............
... . . . . . . .. .. . .
. . . . ... . . . . .. .. .. . . ...
................................................................... ............................................
... . ........... ..... .... . ... ..
.... .. .... . ..............................
... ..
.....................................
-5
-10
For standard d-dimensional Brownian motion, we have

d
X
2
E((Xtj − X0j )2 ) = dt,
¡
E kXt − X0 k =
j=1
√
so that, on average, d-dimensional Brownian motion is about dt units
from its starting position at time t.
p fact, the average speed of Brownian motion over [0, t] is E(kXt −X0 k)/t ∼
In
d/t. For large t, this is near zero while for small t, it is near ∞.
Proposition. Xt is not differentiable at t = 0, i.e.,
P (ω : Xt0 (ω) exists at t = 0) = 0.

Proof:
² ³
(ω : Xt0 (ω) exists at t = 0) ⊆ ω : sup kXt (ω) − X0 (ω)k/t < ∞
0<t≤1
² ³
n
⊆ ω : sup 2 kX2−n (ω) − X0 (ω)k < ∞
n
² ³
n/2
⊆ ω : sup 2 kX2−n (ω) − X2−(n−1) (ω)k < ∞
n
∞ ²
[ ³
n/2
= ω : sup 2 kX2−n (ω) − X2−(n−1) (ω)k < k .
n
k=1
¡
Define Ak = ω : supn 2n/2 kX2−n (ω) − X2−(n−1) (ω)k < k . The random vari-
ables Zn := 2n/2 (X2−n − X2−(n−1) ) are i.i.d. multivariate normal, so
X XY
P (ω : Xt0 (ω) exists at t = 0) ≤ P(Ak ) = P(kZn k < k) = 0.
k k n
u
t
A more complicated argument gives

P (ω : t 7→ Xt (ω) is not differentiable at all t ≥ 0) = 1.
B The reflection principle
By the three ingredients (1)–(3) that define Brownian motion, we see that
for any fixed s ≥ 0, the process (Xt+s − Xs ) is a Brownian motion indepen-
dent of Fs that starts at the origin. In other words, (Xt+s ) is a Brownian
motion, independent of Fs , with random starting point Xs .
An important generalization says that if T is a finite stopping time then
(XT +t ) is independent of FT , with random starting point XT .
Suppose Xt is a standard 1-dimensional Brownian motion starting at x,
and let x < b. We will prove that
P (Xs ≥ b for some 0 ≤ s ≤ t) = 2P(Xt ≥ b).
This follows from stopping the process at Tb , the first time (Xt ) hits the
point fbg, then using symmetry. The picture below will help you to under-
stand the calculation:
1
P(Xt ≥ b) = P(Xt ≥ b j Tb ≤ t)P(Tb ≤ t) = P(Tb ≤ t),
2
which gives the result.
.. Reflection principle ... .

..............................
...... .......... ....... ... .. ...
.............
........... .......
... .....
. ...... ..........
.. .
.. . . .
....... ......
... ... ..... .
... ..... .
b .................................................................................................................................................................................................................................................................................................................................................................................................................
... ......... ... ......
. .
.
.
..
.
.
.
.. .
.
... .. ...... .
.
..
..
.
..
..
• . . .
..... ..........
... . . ... . .. . . . . .
...
. ... ..
.. .............. ...... .............. ...
......
... . .. .. .... ...... ...
...
.. ...... .. ............ .. ........... ... ........... ........ ....... .... ... ...
... . ............ . ..
.
.. . . . . .. .
............. . .. . . .......................
... .... .............. .. .... .... ...... .... ......... . .... ... ....
... .. . .... ... ........ .... ...
... . ...... .... ......
... ................... ..
... .. . .. .
.
. .. . .
. .
.
... ............ ... ............. . ........ ...................... ...... ....
... . .. ......... ......... .......... ... ...
......... ...... ..........
. .
.. .
... ....... . ........ .....
.... ..... .. . ........ ....... . ...
....... ....... ........... ... ...
... . ... ...
....... .................. ....... ........ .. .... . .......................................... .... ...
... ..... .......... .
. . . ...
.
... ......... ................ . . ..... .
.
..
.
. .. ...
. ..... .... .... .. ...
... ............ ............ ... ... ....... . ...................... . ..... .. ...
... ... . ..... .... ...... ...... ...... ...
... . ........ ..
... ..... . . .
........... ...
... ......... . ....
. ...
....... ...
...... ...
...... ...
x .........................................................................................................................................................................................................................................................................................................................................................................
... ...
.
... ..
....
..
.
Tb t
If we now fix a < x < b, we may write explicitly

° √±
Px (Tb ≤ t) = 2Px (Xt ≥ b) = 2P Z ≥ (b − x)/ t .
Letting t → ∞ we find Px (Xt ever hits b) = 2P(Z ≥ 0) = 1.

This shows that 1-dimensional Brownian motion will eventually hit ev-
ery value greater than the starting position. Since −Xt is also a Brow-
nian motion, Xt will also hit every value less than the starting point:
Px (Xt hits a) = P−x (−Xt hits − a) = 1.
Now we use the strong Markov property again to show that
Px (Xt hits b, then hits a) = Px (Xt hits b)Pb (Xt hits a) = 1.
In particular Xt must return to its starting point. You can extend this
argument to prove
Px (Xt hits all points infinitely often ) = 1.

Now let T be the hitting time of the set fa, bg. Since (Xt ) is a martingale,
we have
x = Ex (X0 ) = Ex (XT ) = aPx (XT = a) + bPx (XT = b).
Using the fact that Px (XT = a) + Px (XT = b) = 1, we can conclude that
Px (XT = b) = (x − a)/(b − a).
Just like for the symmetric random walk, (Xt2 − t) is a martingale so
Ex (X02 − 0) = Ex (XT2 − T )
x2 = a2 Px (XT = a) + b2 Px (XT = b) − Ex (T ).
The previous result plus a little algebra shows that
Ex (T ) = (b − x)(x − a).
If we let a → −∞, we find that, although Px (Tb < ∞), we have Ex (Tb ) = ∞.
C The Dirichlet problem
Let (Xt ) be a d-dimensional Brownian motion and f : Rd → R. We want

to study the function u(t, x) := Ex (f(Xt )).
The Taylor’s series of f about the point z is
1
f(y) = f(z) + h∇f(z), y − zi + hy − z, D2 f(z)(y − z)i + o(ky − zk2 ).
2
Setting y = Xt , z = Xs , and taking expectations we get
X
Ex (f(Xt )) = Ex (f(Xs )) + Ex (∂i f(Xs ))Ex (Xti − Xsi )
i
1X
+ Ex (∂ij2 f(Xs ))Ex [(Xti − Xsi )(Xtj − Xsj )] + o(E(kXt − Xs k2 ))
2 i,j
À !
1 X
= Ex (f(Xs )) + Ex ∂ii2 f(Xs ) (t − s) + o(jt − sj)
2 i
Therefore we see that u satisfies

∂ 1
u(t, x) = Ex (∆f(Xt )) .
∂t 2
To find the spatial derivatives of u we use the translation invariance of

Brownian motion.
Ey (f(Xt )) = Ex (f(Xt + (y − x)))

² ³
1 2 2
= Ex f(Xt ) + h(∇f)(Xt ), y − xi + hy − x, D f(Xt )(y − x)i + o(ky − xk )
2
1
= Ex (f(Xt )) + hEx ((∇f)(Xt )), y − xi + hy − x, Ex (D2 f(Xt ))(y − x)i
2
+ o(ky − xk2 ).
In particular, we have D2 u(t, x) = Ex (D2 f(Xt )) and hence ∆u(t, x) =

Ex ((∆f)(Xt )). In other words, u satisfies the “heat equation”
∂ 1
u(t, x) = (∆u)(t, x).
∂t 2
Let us explore the connection with the heat equation on a bounded region
D of Rd . We fix a temperature distribution g on ∂D (the boundary) for all
time, and begin with an initial temperature distribution f in D at time 0.
The latter distribution will flow and eventually dissipate completely.
The solution to the heat equation for x ∈ D can be expressed as
¡
u(t, x) = Ex f(Xt )1ft<T g + g(XT )1ft≥T g ,
where T is the time when the process first hits the boundary. Letting
t → ∞ we arrive at
v(x) = Ex (g(XT )), (1)
the solution to the Dirichlet problem. That is, v is harmonic (∆v(x) =
0) inside D, and v = g on ∂D. For bounded regions D with smooth
boundaries, formula (1) gives the unique solution to the Dirichlet problem.
Example C-1 Let’s apply this result to a problem we already solved using
martingales. Let (Xt ) be 1-dimensional Brownian motion, D = (a, b), and
put g(a) = 0 and g(b) = 1. For d = 1, “harmonic” means “linear” so we
solve the problem with a straight line segment with the right boundary
conditions. This gives us
x−a
v(x) = Ex (g(XT )) = Px (XT = b) = .
b−a
....
......... ....
......... ...
.........
.
................ ...
...
..
.........
................
.
...
...
..
............ ...
.............. ...
.............. ...
...
........... ...
. ...
........... . . .
.............................................................................................................................................................................................................................
a x b
Example C-2 Here’s a multidimensional problem. The probability that

Brownian motion reaches the outer boundary first is given by v(x) =
Ex (g(XT )) where g(x) = 1 if jxj = R2 and g(x) = 0 if jxj = R1 . The
function v will be harmonic in between. Now, the symmetry of Brownian
motion implies that the probability is the same for all x with a common
radius.
...............................
.............. .........
........ .......
....... ......
.
......... ......
...
.... .....
.....
....... .....
...
....
.
...
x. ...
...
...
...
..
. .
.. ..
....................... ...
.... .. .
...... ....... ...
... ..... ....
...
...
... ... . ...
.... ..... R 1 ...
... ...
...
...
...
←... ....................
...
... ..
..
.
...
..
.
.....
.
.
....
.
..
... ... ....... . .
. .
..
.........
...
.......... ..... ...
... ..... ...
...
R ..... ............................
...
...
2 ........... ...
..
.. .
... .... ..
.... ...
...
.
..... ..
.....
..... ......
.....
....
...... ..
...... ......
........ ......
.......... ........
..............................................
So we can write
°P ±1/2
d 2
v(x) = φ(r), where r = i=1 xi .
Taking derivatives of the function r, we find

1 °Pd 2
±−1/2 xi
∂i r = x
i=1 i 2xi =
2 r
0 0 xi
∂i [φ(r)] = φ (r)∂i r = φ (r)
r
so that
xi xi °x ±
i
∂ii [φ(r)] = φ00 (r) + φ0 (r)∂i
r r r
x2i
² ³
00 0 1 −1 xi
= φ (r) 2 + φ (r) + xi 2
r r r r
2
x 1 x2
= φ00 (r) 2i + φ0 (r) − φ0 (r) 3i
r r r
Adding over i gives

d
r2 r2
² ³
X d d−1
∆[φ(r)] = ∂ii [φ(r)] = φ (r) 2 +φ0 (r) −φ0 (r) 3 = φ00 (r)+φ0 (r)
00
.
i=1
r r r r
Solving the one variable equation ∆[φ(r)] = 0 we get the solution

ln jxj − ln(R1 )


 if d = 2
 ln(R2 ) − ln(R1 )

v(x) = 2−d 2−d
 R1 − jxj

if d ≥ 3

R12−d − R22−d

We learn something interesting by taking limits as R2 → ∞. For d = 2,
Px ( ever hits B(0, R1 )) = lim 1 − v(x) = 1.

R2 →∞
Two dimensional Brownian motion will hit any ball, no matter how small,
from any starting point. If we pursue this argument, we can divide R2
using a fine grid, and find that 2-d Brownian motion will visit every section
infinitely often.
On the other hand, if we leave R2 alone and let R1 → 0, we get
Px (Xt = 0 before jXt j = R2 ) = lim 1 − v(x) = 0,

R1 →0
and if we now let R2 → ∞ we discover
Px (Xt ever hits 0) = 0.
Two dimensional Brownian motion will never hit any particular point. The
process is neighborhood recurrent but not point recurrent.
For d ≥ 3, if we let R2 → ∞ we get

² ³2−d
jxj
Px ( ever hits B(0, R1 )) = lim 1 − v(x) = .
R2 →∞ R1
Since this is less than one, we see that Brownian motion is transient when
d ≥ 3.
It turns out that whether or not d-dimensional Brownian motion will hit
a set depends on its fractional dimension. The process can hit sets of
dimension greater than d − 2, but cannot hit sets of dimension less than
d − 2. In the d − 2 case, it depends on the particular set.
7 STOCHASTIC INTEGRATION 88
7 Stochastic integration
A Integration with respect to random walk
Let X1 , X2 , . . . be independent random variables with P(Xi = 1) = P(Xi =

−1) = 1/2. The symmetric random walk can be expressed as Sn = X1 +
· · · + Xn , so that Xi = Si − Si−1 = ∆Si .
Let Fn denote the information in X1 , . . . , Xn , and let Bn be the amount
“bet” on the nth game. We require that Bn ∈ Fn−1 , i.e., the B-process is
predictable.
The winnings up to time n can be written:
n
X n
X
Zn = Bi Xi = Bi ∆Si ,
i=1 i=1
so we can call Z the integral of B with respect to S.

Recall that Z is a martingale
E(Zn+1 − Zn j Fn ) = E(Bn+1 Xn+1 j Fn ) = Bn+1 E(Xn+1 ) = 0.
In particular, E(Zn ) = 0. What about the variance Var (Zn ) = E(Zn2 )?

Squaring the sum gives
X X
Zn2 = Bi2 Xi2 + 2 Bi Bj Xi Xj
i i<j
X X
= Bi2 +2 Bi Bj Xi Xj
i i<j
For i < j we have
E(Bi Bj Xi Xj ) = E(E(Bi Bj Xi Xj j Fj−1 ))

= E(Bi Bj Xi E(Xj ))
= 0
Pn
so E(Zn2 ) = i=1 E(Bi2 ).
B Integration with respect to Brownian motion
Many models of random behaviour suppose that a process X satisfies a

stochastic differential equation dXt = a(Xs ) ds + b(Xs ) dWs . Here the func-
tion a is called the drift coefficient and b the diffusion coefficient. This
equation is understood in the integrated sense
Z t Z t
Xt = X0 + a(Xs ) ds + b(Xs ) dWs .
0 0
Let Wt be a standard 1-dimensional Brownian motion, and Yt theR t amount

“bet” at time t. We want to define the integrated process Zt = 0 Ys dWs .
Rt
We assume that 0 E(Ys2 ) ds < ∞ and that Yt is Ft measurable.
Simple integrands Suppose there are a finite number of times 0 = t0 <

t1 < t2 < · · · < tn and that the process (Yt ) can be written
Y0 0 ≤ t < t1


Y t ≤ t < t

1 1 2
Yt = .. ..
. .


Y n tn ≤ t < ∞
We assume E(Yi2 ) < ∞ and Yi ∈ Fti for all i. Then it makes sense to define:
for tj < t ≤ tj+1
Z t j
X
Zt = Ys dWs = Yi−1 [Wti − Wti−1 ] + Yj [Wt − Wtj ].
...
0 i=1
...
..
...
...
...
...
... ...
... ...
... ...
... ...
... ...
... ...
... ...
...
...
··· ...
...
... ...
... ...
... ...
... ...
... ..
....................................................................................................................................................................................................................................................................................................... .
0 t1 t2 t3 ··· tj t tj+1
Here are some facts about the integral we’ve defined
1. Linearity: If X and Y are two simple integrands, then so is aX + bY

and
Z t ²Z t ³ ²Z t ³
(aXs + bYs ) dWs = a Xs dWs + b Ys dWs .
0 0 0
2. Martingale property: Clearly Zt ∈ Ft and E(Zt2 ) < ∞. Now if

tj ≤ s ≤ t ≤ tj+1 for some j then Zt − Zs = Yj [Wt − Ws ] so
E(Zt − Zs j Fs ) = E(Yj [Wt − Ws ] j Fs ) = Yj E(Wt − Ws j Fs ) = 0,
which is the martingale equation. Now if s ≤ tj < · · · < tk ≤ t, then
Zt − Zs = (Ztj − Zs ) + k−1
P
i=j (Zti+1 − Zti ) + (Zt − Ztk ), so that
k−1
X
E(Zt − Zs j Fs ) = E(Ztj − Zs j Fs ) + E(Zti+1 − Zti j Fs ) + E(Zt − Ztk j Fs )
i=j
k−1
X
= E(Ztj − Zs j Fs ) + E(E(Zti+1 − Zti j Fti ) j Fs )
i=j
+E(E(Zt − Ztk j Ftk ) j Fs )
= 0.
Rt
3. Variance formula: E(Zt2 ) = 0 E(Ys2 ) ds. This follows exactly as for
integration with respect to random walk.
For integrands Yt that are not simple, we define a simple approximation as

follows
(n)
Yt = Yi/n for i/n ≤ t < (i + 1)/n.
Rt
The stochastic integral Zt = 0 Ys dWs is defined as the limit
Z t
Zt = lim Ys(n) dWs .
n→∞ 0
The linearity, martingale property, and variance formula carry over to (Zt ).
Example B-1 Let f be a differentiable non-random function. Then

Z t Z t
Zt = f(s) dWs = (Wt f(t) − W0 f(0)) − Ws df (s).
0 0
Rt
Then Zt is a normal random variable with mean zero and variance 0 f 2 (s) ds.
We can show that Zt has independent increments as well, so that Z is just
a time changed Brownian motion
² ³
t
2
Zt = B ∫ f (s) ds .
0
C Ito’s formula
Let f be a differentiable function and write f(t) as the telescoping sum

n−1
X
f(t) = f(0) + [f((j + 1)t/n) − f(jt/n)]
j=0
n−1
X n−1
X
0
= f(0) + f (jt/n)(t/n) + o(t/n)
j=0 j=0
Z t
∼ f(0) + f 0 (s) ds + 0
0
In a similar vein, let Wt be a Brownian motion and write f(Wt ) as tele-

scoping sum
n−1
X
f(Wt ) = f(W0 ) + [f(W(j+1)t/n ) − f(Wjt/n )]
j=0
n−1
X
f 0 (Wjt/n ) W(j+1)t/n − Wjt/n
¡
= f(W0 ) +
j=0
n−1 n−1 h
1 X
00
¡2 X ¡2 i
+ f (Wjt/n ) W(j+1)t/n − Wjt/n + o W(j+1)t/n − Wjt/n
2 j=0 j=0
¡2
The intuition behind Ito’s formula is that you can replace W(j+1)t/n − Wjt/n
by t/n with only a small amount of error. Therefore
n−1
X
f 0 (Wjt/n ) W(j+1)t/n − Wjt/n
¡
f(Wt ) = f(W0 ) +
j=0
n−1 n−1
1 X 00 X
+ f (Wjt/n ) (t/n) + o(t/n) + error .
2 j=0 j=0
Letting n → ∞ gives Ito’s formula
Z t Z t
0 1
f(Wt ) = f(W0 ) + f (Ws ) dWs + f 00 (Ws ) ds.
0 2 0
Rt
Example C-1 Suppose we want to calculate 0 Ws dWs . The definition
gets us nowhere so we try to apply the usual rules of calculus
Z t Z t
2 2
Ws dWs = Wt − W0 − Ws dWs ,
0 0
Rt
which implies 0 Ws dWs = [Wt2 − W02 ]/2. The only problem is that this
formula is false! Since W0 = 0, we can see that it is fishy by taking expec-
tations on both sides: the left hand side gives zero but the right hand side
is strictly positive.
The moral of this example is that the usual rulesR of calculus do not apply
t
to stochastic integrals. So how do we calculate 0 Ws dWs correctly? Let
f(t) = t2 , so f 0 (t) = 2t and f 00 (t) = 2. From Ito’s formula we find
Z t
1 t
Z Z t
2 2
Wt = W0 + 2Ws dWs + 2 ds = 2 Ws dWs + t,
0 2 0 0
and therefore Z t
1 2 ¡
Ws dWs = Wt − t .
0 2
A more advanced version of Ito’s formula can handle functions that depend
on t as well as x:
Z t Z t Z t
1
f(t, Wt ) = f(0, W0 ) + ∂s f(s, Ws ) ds + ∂x f(s, Ws ) dWs + ∂xx f(s, Ws ) ds.
0 0 2 0
A peek at math finance

Imagine an economy with two assets: a bond whose value grows at a fixed
rate r and a stock whose price per unit (St ) is a random variable. If Bt is
the cost of one unit of bond at time t we have dBt = rBt dt which implies
Bt = B0 ert . For the stock we have
dSt = St (µ dt + σ dWt ) (∗)
How do we solve this equation? Guess! Let Xt = exp(at + bWt ). From Ito’s
formula with f(t, x) = exp(at + bx) we get
t t
1 t 2
Z Z Z
Xt = X0 + aXs ds + bXs dWs + b Xs ds
0 0 2 0
Z t Z t
b2
= X0 + (a + ) Xs ds + b Xs dWs
2 0 0
To solve (∗) we set b = σ and a = µ − σ 2 /2. The solution to our problem

is a geometric Brownian motion
St = S0 exp σWt + (µ − σ 2 /2) t .

¡¡
8 APPENDIX 94
8 Appendix
A Strong Markov property
We begin with the formula from page 2 for joint probabilities as products
of transition probabilities, both in its unconditional
P(X0 = i0 , X1 = i1 , . . . , Xn = in ) = φ(i0 ) p(i0 , i1 ) · · · p(in−1 , in ), (1)
and conditional form
Pi0 (X1 = i1 , . . . , Xn = in ) = p(i0 , i1 ) p(i1 , i2 ) · · · p(in−1 , in ). (2)
For n, m ≥ 0, we apply formula (1) to the full sequence i0 , i1 , . . . , in , in+1 , . . . , in+m

and then split the product. Using formula (1) on i0 , i1 , . . . , in and formula
(2) on in+1 , . . . , in+m we get
P(X0 = i0 , X1 = i1 , . . . , Xn = in , Xn+1 = in+1 , . . . Xn+m = in+m )
= φ(i0 ) p(i0 , i1 ) p(i1 , i2 ) · · · p(in−1 , in ) × p(in , in+1 ) · · · p(in+m−1 , in+m )
= P(X0 = i0 , X1 = i1 , . . . , Xn = in ) × Pin (X1 = in+1 , . . . , Xm = in+m )
Using indicator functions, we rewrite in terms of expectations

E(1[X0 =i0 ,...,Xn+m =in+m ] ) = E(1[X0 =i0 ,...,Xn =in ] )×Pin (X1 = in+1 , . . . , Xm = in+m ).
For g : S m+1 → R a bounded function of m+1 variables, multiply the above

by g(in , . . . , in+m ) and add up over all possible m + 1-tuples (in , . . . , in+m )
to obtain
E(1[X0 =i0 ,...,Xn =in ] g(Xn , . . . , Xn+m )) = E(1[X0 =i0 ,...,Xn =in ] )×Ein (g(X0 , . . . , Xm )).
On the right hand side, bring the conditional expectation inside the un-
conditional expectation, and then replace in by Xn :
E(1[X0 =i0 ,...,Xn =in ] g(Xn , . . . , Xn+m )) = E(1[X0 =i0 ,...,Xn =in ] Ein (g(X0 , . . . , Xm )))
= E(1[X0 =i0 ,...,Xn =in ] EXn (g(X0 , . . . , Xm )))

8 APPENDIX 95
For f : S n+1 → R a bounded function of n + 1 variables, multiply the

above by f(i0 , . . . , in ) and add up over all possible n + 1-tuples (i0 , . . . , in )
to obtain
E (f(X0 , . . . , Xn ) g(Xn , . . . , Xn+m )) = E (f(X0 , . . . , Xn )EXn (g(X0 , . . . , Xm ))) .
(3)
Fix the function f and consider the formula (3). It holds for many dif-
ferent g functions, unfortunately defined on different spaces. In order to
consolidate our gains, we need to look at functions g : S ∞ → R.
For every m ≥ 1, define the projection operator πm : S ∞ → S m that sends
a sequence to its first m coordinates: πm (i1 , i2 , i3 , . . .) = (i1 , . . . , im ). Let
Bm be the space of bounded functions g : S ∞ → R that only depend on
the first m coordinates. That is, πm (i) = πm (j) implies g(i) = g(j). For
g ∈ Bm we can unambiguously define gm : S m → R so that gm ◦ πm = g. In
this case, g(X0 , . . .) = gm (X0 , . . . , Xm ) so equation (3) applied to gm gives
E (f(X0 , . . . , Xn ) g(Xn , Xn+1 . . .)) = E (f(X0 , . . . , Xn )EXn (g(X0 , X1 , . . .))) .
(4)
Let B = ∪m≥1 Bm be the space of all bounded functions g : S ∞ → R that
depend on finitely many coordinates. Formula (4) holds for every g ∈ B.
The space B is closed under pointwise multiplication, so the Monotone Class
Theorem applies and tells us that (4) holds for all bounded g : S ∞ → R
measurable with respect to the σ-algebra generated by B. This is the same
as the σ-algebra generated by the coordinate mappings.
Theorem A-1 Strong Markov property.
If T is a stopping time, and g : S ∞ → R, then

¡ ¡
E 1[T <∞] g(XT , XT +1 . . .) = E 1[T <∞] EXT (g(X0 , X1 , . . .)) .
Proof: Since T is a stopping time, 1[T =n] = fn (X0 , . . . , Xn ) for some

function fn . Applying (4) with fn gives us
¡ ¡
E 1[T =n] g(Xn , Xn+1 . . .) = E 1[T =n] EXn (g(X0 , X1 . . .)) ,
or equivalently
¡ ¡
E 1[T =n] g(XT , XT +1 , . . .) = E 1[T =n] EXT (g(X0 , X1 , . . .)) .
8 APPENDIX 96
Adding this up over n = 0, 1, 2, . . . gives the result. u

t
Comment. In special cases, XT is a fixed state i, for instance when T is

the hitting time of i. Then the random variable XT is constant and we can
pull that term out of the expectation to obtain
¡
E 1[T <∞] g(XT , XT +1 , . . .) = P(T < ∞) Ei (g(X0 , X1 , . . .)) .
Hitting times and visit times. Recall that for a subset E ⊆ S, we define
TE = inf(n ≥ 1 : Xn ∈ E)
VE = inf(n ≥ 0 : Xn ∈ E).
Theorem A-2 If u(x) = Px (VE < ∞), then (P u)(x) = Px (TE < ∞)
Proof: We can write the visit time explicitly as VE = f(X0 , X1 , . . .)

where
j
∞ Y
X
f(x0 , x1 , . . .) = 1[xk 6∈E] .
j=0 k=0
The exact formula is not so important, but rather the fact that TE =
1 + f(X1 , X2 , . . .). The strong Markov property, with T ≡ 1 and g = 1(f <∞)
gives us the result. u
t
Comment. This theorem shows that u(x) is harmonic at x 6∈ E.
Theorem A-3 Expected value of hitting time. Let (Xn ) be an irre-

ducible Markov chain on a state space with N elements. Show that there
exist C < ∞ and ρ < 1 such that for j ∈ S,
P(Xm 6= j, m = 1, . . . , n) ≤ Cρn . (5)
This implies that E(T ) < ∞, where T = inf(n ≥ 1 j Xn = j).

8 APPENDIX 97
Proof: We begin with the special case n = N . Since the chain is ir-
reducible, it is possible to go from any state i to any state j. Consider
the shortest non-trivial path between them that has non-zero probability,
say i → s2 → s3 → · · · → sK → j. Because it is the shortest path, there
aren’t any repeats in the set fi, s2 , . . . , sK g, as we could remove unneces-
sary loops to get a shorter path with larger probability. With no repeats
we have K ≤ N , so there are at most N transitions in the path, and hence
Pi (T > N ) = Pi (Xm 6= j, m = 1, . . . , N) = 1 − δi < 1.
Now define δ = inffδi j i ∈ Sg so δ > 0, and
Pi (T > N ) ≤ 1 − δ.
Using the initial distribution of X0 we also get
P(T > N ) ≤ 1 − δ.
Now let’s try the case n = 2N . Define f(x1 , . . . , xN ) = 1[x1 6=j,...,xN 6=j] and
use the Markov property (4) at N to get
P(T > 2N ) = E(f(X1 , . . . , XN )f(XN +1 , . . . , X2N ))
= E[f(X1 , . . . , XN )EXN (f(X1 , . . . , XN ))]
≤ (1 − δ) E[f(X1 , . . . , XN )]
≤ (1 − δ)2 .
In a similar manner, we can show that for every integer q ≥ 0 we have
P(T > qN ) ≤ (1 − δ)q .
For general n we write n = qN + r where 0 ≤ r < N , and note that the
probability on the left of (5) is decreasing in n. Therefore
P(T > n) ≤ (1 − δ)q ≤ (1 − δ)−1 (1 − δ)n/N ,
which gives us the required result with C = (1 − δ)−1 and ρ = (1 − δ)1/N .
It follows that
∞ ∞
X X C
E(T ) = P(T > n) ≤ Cρn = < ∞.
n=0 n=0
1−ρ
8 APPENDIX 98
Theorem A-4 mij = Pi (Tj < ∞) mjj .
Proof: Define
n
1X
g(x0 , x1 , . . .) = lim sup 1(x =j)
n n k=1 k
n n
1X 1X
h(x0 , x1 , . . .) = lim sup 1(xk =j) − lim inf 1(x =j) .
n n k=1 n n k=1 k
With Tj = inf(k ≥ 1 j Xk = j), the strong Markov property gives
Ei (f(X0 , X1 , . . .)) = Ei (1(Tj <∞) f(X0 , X1 , . . .))

= Ei (1(Tj <∞) f(XTj , XTj +1 , . . .)) (6)
= Pi (Tj < ∞) Ej (f(X0 , X1 , . . .)),
where (6) holds for functions f that don’t depend on the first few coor-
dinates, and such that f(x0 , x1 , . . .) = 0 when xk 6= j for all k ≥ 0. The
functions g and h defined above satisfy both conditions.
Case 1. If Pj (Tj < ∞) = 1, the law of large numbers argument on page

12 shows that
n
1X
lim 1(Xk =j) exists Pj -almost surely.
n n
k=1
This means that h(X0 , X1 , . . .) = 0 Pj -almost surely. Applying the strong

Markov property (6) with the function h shows that Ei (h(X0 , X1 , . . .)) = 0.
It follows that h(X0 , X1 , . . .) = 0 Pi -almost surely so that
n
1X
lim 1(Xk =j) exists Pi -almost surely.
n n
k=1
By bounded convergence, equation (6) with the function g gives us mij =

Pi (Tj < ∞) mjj .
Case 2. If Pj (Tj < ∞) < 1, then putting i = j in (6) gives
Ej (g(X0 , X1 , . . .)) = 0.
8 APPENDIX 99
The rest of the argument follows as in Case 1. u

t
Theorem A-5 E(R) = 1 + P(R > 1)E(R)
Proof:
À∞ !
X
Ex (R) = Ex 1(Xn =x)
n=0
À∞ !
X
= 1 + Ex 1(Xn =x)
À n=1 ∞
!
X
= 1 + Ex 1(Tx <∞) 1(Xn =x)
n=1
À ∞
!
X
= 1 + Ex 1(Tx <∞) 1(Xn =x)
n=Tx
À ∞
!
X
= 1 + Ex 1(Tx <∞) EXTx ( 1(Xn =x) )
À n=0
∞
!
X
= 1 + Px (Tx < ∞) Ex 1(Xn =x)
n=0
= 1 + Px (R > 1) Ex (R).
u
t
B Matrix magic
Let’s begin with something we know: For n ≥ 1
X
Pijn = pik1 pk1 k2 · · · pkn−1 j = Pi (Xn = j).
k1 ,...,kn−1 ∈S
The left hand side is the (i, j)th entry of the matrix P n , while the middle
adds the probabilities of all paths of length n from state i to state j. The
right hand side explains in terms of probability.
8 APPENDIX 100
Divide the sample space S into disjoint pieces D and E, and define Q as
the submatrix of P of transition probabilities from D into itself.
Formula 1: For i, j ∈ D, we have
X
Qnij = pik1 pk1 k2 · · · pkn−1 j .
k1 ,...,kn−1 ∈D
The sum is over all paths of length n from state i to state j that stay
inside D. So in probability terms
Qnij = Pi (Xn = j, Xn−1 ∈ D, . . . , X1 ∈ D) = Pi (Xn = j, TE > n),
where TE := inf(n ≥ 1 : Xn ∈ E) is the hitting time of E. Adding over n

yields À∞ !
X
−1
(I − Q)ij = Ei 1(Xn =j) 1(TE >n) ,
n=0
in words, the expected number of visits to state j before hitting E. Taking

the row sum means adding over j ∈ D to give
À∞ !
X
ith row sum of (I − Q)−1 = Ei 1(Xn ∈D) 1(TE >n)
À n=0
∞
!
X
= Ei 1(TE >n)
n=0
= Ei (TE ).
Formula 2: For i ∈ D and j ∈ E, the (i, j)th element of the matrix Qn S

is X X
n
qik skj = Pi (Xn = k, TE > n)P(Xn+1 = j j Xn = k).
k∈D k∈D
By the Markov property,
P(Xn+1 = j j Xn = k) = P(Xn+1 = j j Xn = k, TE > n, X0 = i)
which gives
X
(Qn S)ij = Pi (Xn = k, TE > n)P(Xn+1 = j j Xn = k, TE > n, X0 = i)
k∈D
8 APPENDIX 101
X
= Pi (Xn+1 = j, Xn = k, TE > n)
k∈D
= Pi (Xn+1 = j, TE > n)
= Pi (XTE = j, TE = n + 1)
Adding up over n we get

∞
X
−1
((I − Q) S)ij = Pi (XTE = j, TE = n + 1) = Pi (XTE = j).
n=0
C The algorithm from section 3C
Define the function u1 (x) = supy : x→y f(y), where the supremum is taken
over all states y that can be reached from x.
We first show that u1 is superharmonic. Suppose x, z ∈ S and that you
can reach x from z. Then any state y that can be reached from x, can also
be reached from z and therefore u1 (x) ≤ u1 (z). Thus
X X X
(P u1 )(z) = u1 (x)p(z, x) = u1 (x)p(z, x) ≤ u1 (z)p(z, x) ≤ u1 (z).
x∈S x∈S,z→x x∈S,z→x
Start with the superharmonic function u1 ≥ f. Suppose that un ≥ f is su-

perharmonic, what can we say about un+1 ? We find un+1 = max(P un , f) ≤
max(un , f) = un+1 and P un+1 ≤ P un ≤ un+1 . That is, un+1 is superhar-
monic and f ≤ un+1 ≤ un . Thus un is decreasing so there exists a limit u∞ .
The limit function satisfies u∞ = max(P u∞ , f), hence u∞ is superharmonic
and u∞ ≥ f. Since v is the smallest such function, we have u∞ ≥ v.
Define the stopping time T = inf(n ≥ 0 : u∞ (Xn ) = f(Xn )). Provided
Px (T < ∞) = 1, the lemma below shows u∞ (x) = Ex (u∞ (XT )), so that
u∞ (x) = Ex (u∞ (XT )) = Ex (f(XT )) ≤ v(x),
and we conclude that u∞ = v.

8 APPENDIX 102
In fact, in any recurrent class in S, the state z with the maximum f value
satisfies f(z) = u1 (z) and hence f(z) = u∞ (z). Since the Markov chain
must eventually visit all states of some recurrent class, we see that Px (T <
∞) = 1.
Lemma. For a bounded function u : S → R define Hu := fx ∈ S : u(x) =

(P u)(x)g and let Tu := inf(n ≥ 0 : Xn 6∈ Hu ). If T is a stopping time with
Px (T < ∞, T ≤ Tu ) = 1, then u(x) = Ex (u(XT )).
Proof. We use induction to prove that u(x) = Ex (u(XT ∧n )) for n ≥ 0.

Clearly this is true for n = 0, so let’s assume that it is true for n and try
to prove it for n + 1. Note that
Ex (u(XT ∧(n+1) ) − u(XT ∧n )) = Ex ([u(Xn+1 ) − u(Xn )]1fT ≥n+1g ),
so we need to show that this difference is zero.
Let’s consider Ex (u(Xn+1 )1fT ≥n+1g ). By the stopping time property, the
set fT ≥ n + 1g = fT ≤ ngc ∈ σ(X0 , . . . , Xn ), so the indicator function
1fT ≥n+1g is a function of X0 , . . . , Xn . We write 1fT ≥n+1g = g(X0 , . . . , Xn )
for some g.
Therefore
Ex (u(Xn+1 )1fT ≥n+1g )
X
= u(y) g(x0 , . . . , xn )Px (X0 = x0 , . . . , Xn = xn , Xn+1 = y)
x0 ,...,xn ∈Hu ;y∈S
X
= u(y) g(x0 , . . . , xn )Px (X0 = x0 , . . . , Xn = xn ) p(xn , y)
x0 ,...,xn ∈Hu ;y∈S
X
= (P u)(xn ) g(x0 , . . . , xn )Px (X0 = x0 , . . . , Xn = xn )
x0 ,...,xn ∈Hu
X
= u(xn ) g(x0 , . . . , xn )Px (X0 = x0 , . . . , Xn = xn )
x0 ,...,xn ∈Hu
= Ex (u(Xn )1fT ≥n+1g ).

The difference is zero, so the induction step is complete.
8 APPENDIX 103
Letting n → ∞ in u(x) = Ex (u(XT ∧n )) gives u(x) = Ex (u(XT )), since

Px (T < ∞) = 1.
Index
binomial pricing model, 51 via martingales, 68
branching process, 35 Wheel of Fortune, 48
extinction, 37
via martingales, 72 filtration, 62
Brownian motion harmonic function, 28
geometric, 93 hitting time, 12
standard, 79 finite expected value
Chapman-Kolmogorov equation, 79 proof, 96
communicating states, 17 infinitesimal generator, 76
conditional expectation, 59 invariant vector, 8
continuous time chains, 74 Ito’s formula, 92
ergodic, 10 advanced, 92
Example Markov chain
algorithmic efficiency, 26 countable, 29
application to linear algebra, ergodic, 10
70 finite, 1
card colour, 1, 3 irreducible, 17
casino credit, 42 time homogeneous, 2
genetics model, 4 martingale, 62
Google, 15 convergence theorem, 70
guessing red, 69 matrix
math finance, 93 M , 10
monkey, 69 formula proof, 98
$100 or bust, 22 magic, 20
one die game, 45 proof, 100
Polya’s urn, 64 P, 3
rat, 13, 21 Q, 20
Schmuland family tree, 36, 38 S, 20
spider and fly, 22 stochastic, 3
two dice game, 46 transition, 3
two state chain, 3, 6 matrix exponential, 76
invariant probability, 9
waiting for patterns, 25 null, 16
104
INDEX 105
null class, 17 strong Markov property, 3, 95

null recurrent, 35 submartingale, 65
superharmonic, 40
optimal strategy, 41 supermartingale, 65
with cost, 50
with discounting, 50 transient, 16
option
American, 53 value function, 40
call, 52 visit time, 12
European, 53
put, 53
optional sampling theorem, 65
payoff function, 39
Poisson process, 74
Polya’s urn, 72
positive, 16
positive class, 17
positive recurrent, 35
random harmonic series, 73

random walk, 42
absorbing boundaries, 4, 23
d-dimensional, 30
invariant probability, 14
one dimensional, 5, 30, 31, 35
recurrence classes, 20
reflecting boundaries, 3
symmetric, 5
two dimensional, 35
via martingales, 66
with drift, 5
recurrent, 16
null, 35
positive, 35
stochastic integral, 90
stopping time, 3, 11, 39, 65
strategy, 39

Stochastic - Lecture Notes

Uploaded by

Copyright:

Available Formats

Stochastic - Lecture Notes

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stochastic - Lecture Notes

Uploaded by

Copyright:

Available Formats

Stochastic processes without measure theory

1 Finite Markov Chains 1

2 Countable Markov Chains 29

5 Continuous time Markov chains 74

1 Finite Markov Chains

This can be computed by multiplying conditional probabilities as follows:

P(X0 = i0 ) P(X1 = i1 j X0 = i0 ) P(X2 = i2 j X1 = i1 , X0 = i0 )

Example A-1 We randomly select playing cards from an ordinary deck.

(a) Without replacement

= P(X0 = R) P(X1 = R j X0 = R) P(X2 = B j X1 = R, X0 = R)

(b) With replacement

= P(X0 = R) P(X1 = R j X0 = R) P(X2 = B j X1 = R, X0 = R)

P(Xn+1 = in+1 j Xn = in , . . . , X1 = i1 , X0 = i0 ) = P(Xn+1 = in+1 j Xn = in ).

Exercise. In example A-1, calculate P(X2 = B j X1 = R) and confirm that

Definition A-2 A Markov chain (Xn )∞

P(Xn+1 = j j Xn = i) = p(i, j),

for some function p : S × S → [0, 1].

P(X0 = i0 , X1 = i1 , . . . , Xn = in ) = φ0 (i0 ) p(i0 , i1 ) p(i1 , i2 ) · · · p(in−1 , in ).

We will often be interested in probabilities conditional on the starting po-

The formula above implies the following:

P(X0 = i0 , X1 = i1 , . . . , Xn = in , Xn+1 = in+1 , . . . Xn+m = in+m )

= P(X0 = i0 , X1 = i1 , . . . , Xn = in ) × Pin (X0 = in , X1 = in+1 , . . . , Xm = in+m )

A more sophisticated version of this formula, where n is replaced by a

Example A-2 Card colour with replacement

Example A-3 Markov’s two state Markov chain

In a paper written in 1913, A.A. Markov chose a sequence of 20,000 letters

Example A-4 Random walk with reflecting boundaries.

0 1 ... j−1 j j+1 ... N −1 N

That is, p(j, j + 1) = p and p(j, j − 1) = 1 − p, for j = 1, . . . , N − 1. The

Example A-5 Random walk with absorbing boundaries.

As above, except with boundary conditions p(0, 0) = 1 and p(N, N ) = 1.

Example A-6 A genetics model.

Imagine a population of fixed size N , with two types of individuals; A

Example A-7 Symmetric vs. Random walk with drift.

Unbiased random walk

...... ............. ...... ... ....... ....... ..........

Biased random walk

or behind. In the real situation however, the laws of probability have

P(X0 = 0, X1 = 1, X2 = 0, X3 = 1, X4 = 0) = φ0 (0) p(0, 1) p(1, 0) p(0, 1) p(1, 0)

P(X0 = 0, X1 = 0, X2 = 0, X3 = 0, X4 = 0) = φ0 (0) p(0, 0) p(0, 0) p(0, 0) p(0, 0)

If (as in many situations) we were interested in conditional probabilities,

Here’s a harder problem. Suppose we want to find P0 (X4 = 0)? Instead

In the problem above, matrix multiplication gives

Probability vectors Let φ0 = (φ0 (1), . . . , φ0 (i), . . . , φ0 (N )) be the 1 × N

pn (1, 1) ··· pn (1, j) ··· pn (1, N )

But for each j ∈ S,

Example B-1 If we start in state zero, then φ0 = (1, 0) and

Theorem B-2 Markov chain probabilities Given time points 0 ≤ n0 ≤

Definition C-1 A probability vector π is called invariant for the Markov

(c) Consider a two state Markov chain with

Theorem C-1 A probability vector π is invariant if and only if there is a

Proof: (⇒) Suppose that π is invariant so π = πP . Multiplying on the

(⇐) Suppose that π = limn vP n . Multiply both sides on the right by P to

Now, if 0 < p + q < 2 then (1 − (p + q))n → 0 as n → ∞ and

Definition C-2 A Markov chain is called ergodic if P n converges to a

Theorem C-2 If P is a stochastic matrix with P n > 0 for some n ≥ 1,

Theorem C-3 If P is a stochastic matrix, then (1/n) nk=1 P k → M . The

(⇒) If π is an invariant probability vector, then π = πP k for all k ≥ 1.