Lecture 3 - MDPs and Dynamic Programming
Lecture 3 - MDPs and Dynamic Programming
Lecture 3 - MDPs and Dynamic Programming
Diana Borsa
Observations:
I p defines the dynamics of the problem
I Sometimes it is useful to marginalise out the state transitions or expected reward:
X X X
p(s 0 | s, a) = p(s 0 , r | s, a) E [R | s, a] = r p(r , s 0 | s, a) .
r r s0
Markov Decision Process: Alternative Definition
r = E [R | s, a]
I γ ∈ [0, 1] is a discount factor that trades off later rewards to earlier ones
Note: These are equivalent formulations: no additional assumptions w.r.t the previous def.
Markov Property: The future is independent of the past given the present
In a Markov Decision Process all states are assumed to have the Markov property.
I The state captures all relevant information from the history.
I Once the state is known, the history may be thrown away.
I The state is a sufficient statistic of the past.
Markov Property in a MDP: Test your understanding
In a Markov Decision Process all states are assumed to have the Markov property.
Q: In an MDP this property implies: (Which of the following statements are true?)
−t−1
TX
Gt = Rt+1 + Rt+2 + ... + RT = Rt+k+1
k=0
−t−1
TX
Gt = Rt+1 + γRt+2 + ... + γ T −t RT = γ k Rt+k+1
k=0
−t−1
TX
1 1
Gt = (Rt+1 + Rt+2 + ... + RT ) = Rt+k+1
T −t −1 T −t −1
k=0
Note: These are random variables that depends on MDP and policy
Discounted Return
Goal of an RL agent
To find a behaviour policy that maximises the (expected) return Gt
I A policy is a mapping π : S × A → [0, 1] that, for every state s assigns for each
action a ∈ A the probability of taking that action in state s. Denoted by π(a|s).
I For deterministic policies, we sometimes use the notation at = π(st ) to denote the
action taken by the policy.
Value Functions
vπ (s) = E [Gt | St = s, π]
qπ (s, a) = E [Gt | St = s, At = a, π]
The optimal action-value function q ∗ (s, a) is the maximum action-value function over
all policies
I The optimal value function specifies the best possible performance in the MDP
I An MDP is “solved” when we know the optimal value function
Optimal Policy
1 if a = argmax q ∗ (s, a)
(
∗
π (s, a) = a∈A
0 otherwise
Observations:
I There is always a deterministic optimal policy for any MDP
I If we know q ∗ (s, a), we immediately have the optimal policy
I There can be multiple optimal policies
I If multiple actions maximize q∗ (s, ·), we can also just pick any of these
(including stochastically)
Bellman Equations
Value Function
vπ (s) = E [Gt | St = s, π]
qπ (s, a) = E [Gt | St = s, At = a, π]
I This implies
I Note that
X
vπ (s) = π(a | s)qπ (s, a) = E [qπ (St , At ) | St = s, π] , ∀s
a
Bellman Equations
There can be no policy with a higher value than v∗ (s) = maxπ vπ (s), ∀s
Some intuition
(Reminder) Greedy on v ∗ = Optimal Policy
I An optimal policy can be found by maximising over q ∗ (s, a),
1 if a = argmax q ∗ (s, a)
(
∗
π (s, a) = a∈A
0 otherwise
I The actions have a 0.9 probability of success and with 0.1 probably we remain in the
same state
I Rt = 0 for all transitions that end up in S0 , and Rt = −1 for all other transitions
Exercise: (pause to work this out)
I Consider the following MDP:
I The actions have a 0.9 probability of success and with 0.1 probably we remain in the
same state
I Rt = 0 for all transitions that end up in S0 , and Rt = −1 for all other transitions
I The Bellman value equation, for given π, can be expressed using matrices,
v = rπ + γPπ v
where
vi = v (si )
riπ = E [Rt+1 | St = si , At ∼ π(St )]
X
Pijπ = p(sj | si ) = π(a | si )p(sj | si , a)
a
Bellman Equation in Matrix Form
I The Bellman equation, for a given policy π, can be expressed using matrices,
v = rπ + γPπ v
v = rπ + γPπ v
(I − γPπ ) v = rπ
v = (I − γPπ )−1 rπ
Algorithm
I First, initialise v0 , e.g., to zero
I Then, iterate
∀s : vk+1 (s) ← E [Rt+1 + γvk (St+1 ) | s, π]
I Stopping: whenever vk+1 (s) = vk (s), for all s, we must have found vπ
Algorithm
Iterate, using
I Claim: One can show that vπnew (s) ≥ vπ (s), for all s
Policy Improvement: qπnew (s, a) ≥ qπ (s, a)
Policy Iteration
I Extreme: Why not update policy every iteration — i.e. stop after k = 1?
I This is equivalent to value iteration
Value Iteration
I We could take the Bellman optimality equation, and turn that into an update
I Initialise v0
I Update:vk+1 (s) ← maxa E [Rt+1 + γvk (St+1 ) | St = s, At = s]
I Stopping: whenever vk+1 (s) = vk (s), for all s, we must have found v ∗
Example: Shortest Path
g 0 0 0 0 0 -1 -1 -1 0 -1 -2 -2
0 0 0 0 -1 -1 -1 -1 -1 -2 -2 -2
0 0 0 0 -1 -1 -1 -1 -2 -2 -2 -2
0 0 0 0 -1 -1 -1 -1 -2 -2 -2 -2
Problem V1 V2 V3
0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3
-1 -2 -3 -3 -1 -2 -3 -4 -1 -2 -3 -4 -1 -2 -3 -4
-2 -3 -3 -3 -2 -3 -4 -4 -2 -3 -4 -5 -2 -3 -4 -5
-3 -3 -3 -3 -3 -4 -4 -4 -3 -4 -5 -5 -3 -4 -5 -6
V4 V5 V6 V7
Synchronous Dynamic Programming Algorithms
Observations:
I Algorithms are based on state-value function vπ (s) or v ∗ (s) ⇒ complexity
O(|A||S|2 ) per iteration, for |A| actions and |S| states
I Could also apply to action-value function qπ (s, a) or q ∗ (s, a) ⇒ complexity
O(|A|2 |S|2 ) per iteration
Extensions to Dynamic Programming
Asynchronous Dynamic Programming
I Asynchronous DP
I backs up states individually, in any order
I can significantly reduce computation
I guaranteed to converge if all states continue to be selected
Asynchronous Dynamic Programming
The only stupid question is the one you were afraid to ask but never did.
-Rich Sutton
For questions that may arise during this lecture please use Moodle and/or the next
Q&A session.