Lecture 3 - MDPs and Dynamic Programming

Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

Lecture 3:

Markov Decision Processes and Dynamic Programming

Diana Borsa

January 15, 2021


Background

Sutton & Barto 2018, Chapter 3 + 4


Recap

I Reinforcement learning is the science of learning to make decisions


I Agents can learn a policy, value function and/or a model
I The general problem involves taking into account time and consequences
I Decisions affect the reward, the agent state, and environment state
This Lecture

I Last lecture: multiple actions, but only one state—no model


I This lecture:
I Formalise the problem with full sequential structure
I Discuss first class of solution methods which assume true model is given
I These methods are called dynamic programming
I Next lectures: use similar ideas, but use sampling instead of true model
Formalising the RL interaction
Formalising the RL interface

I We will discuss a mathematical formulation of the agent-environment interaction


I This is called a Markov Decision Process (MDP)
I Enables us to talk clearly about the objective and how to achieve it
MDPs: A simplifying assumption

I For now, assume the environment is fully observable:


⇒ the current observation contains all relevant information

I Note: Almost all RL problems can be formalised as MDPs, e.g.,


I Optimal control primarily deals with continuous MDPs
I Partially observable problems can be converted into MDPs
I Bandits are MDPs with one state
Markov Decision Process
Definition (Markov Decision Process - Sutton & Barto 2018 )
A Markov Decision Process is a tuple (S, A, p, γ), where
I S is the set of all possible states
I A is the set of all possible actions (e.g., motor controls)
I p(r , s 0 | s, a) is the joint probability of a reward r and next state s 0 , given a state s
and action a
I γ ∈ [0, 1] is a discount factor that trades off later rewards to earlier ones

Observations:
I p defines the dynamics of the problem
I Sometimes it is useful to marginalise out the state transitions or expected reward:
X X X
p(s 0 | s, a) = p(s 0 , r | s, a) E [R | s, a] = r p(r , s 0 | s, a) .
r r s0
Markov Decision Process: Alternative Definition

Definition (Markov Decision Process)


A Markov Decision Process is a tuple (S, A, p, r ,γ), where
I S is the set of all possible states
I A is the set of all possible actions (e.g., motor controls)
I p(s 0 | s, a) is the probability of transitioning to s 0 , given a state s and action a
I r : S × A → R is the excepted reward, achieved on a transition starting in (s, a)

r = E [R | s, a]

I γ ∈ [0, 1] is a discount factor that trades off later rewards to earlier ones

Note: These are equivalent formulations: no additional assumptions w.r.t the previous def.
Markov Property: The future is independent of the past given the present

Definition (Markov Property)


Consider a sequence of random variables, {St }t∈N , indexed by time. A state s has the
Markov property when for states ∀s 0 ∈ S

p St+1 = s 0 | St = s = p St+1 = s 0 | ht−1 , St = s


 

for all possible histories ht−1 = {S1 , . . . , St−1 , A1 , . . . , At−1 , R1 , . . . , Rt−1 }

In a Markov Decision Process all states are assumed to have the Markov property.
I The state captures all relevant information from the history.
I Once the state is known, the history may be thrown away.
I The state is a sufficient statistic of the past.
Markov Property in a MDP: Test your understanding

In a Markov Decision Process all states are assumed to have the Markov property.

Q: In an MDP this property implies: (Which of the following statements are true?)

p St+1 = s 0 | St = s, At = a = p St+1 = s 0 | S1 , . . . , St−1 , A1 , . . . , At , St = s (1)


 

p St+1 = s 0 | St = s, At = a = p St+1 = s 0 | S1 , . . . , St−1 , St = s, At = a


 
(2)

p St+1 = s 0 | St = s, At = a = p St+1 = s 0 | S1 , . . . , St−1 , St = s


 
(3)

p Rt+1 = r , St+1 = s 0 | St = s = p Rt+1 = r , St+1 = s 0 | S1 , . . . , St−1 , St = s (4)


 
Example: cleaning robot

I Consider a robot that cleans soda cans


I Two states: high battery charge or low battery charge
I Actions: {wait, search} in high, {wait, search, recharge} in low
I Dynamics may be stochastic
I p(St+1 = high | St = high, At = search) = α
I p(St+1 = low | St = high, At = search) = 1 − α
I Reward could be expected number of collected cans (deterministic), or actual
number of collected cans (stochastic)

Reference: Sutton and Barto, Chapter 3, pg 52-53.


Example: robot MDP
Example: robot MDP
Formalising the objective
Returns
I Acting in a MDP results in immediate rewards Rt , which leads to returns Gt :
I Undiscounted return (episodic/finite horizon pb.)

−t−1
TX
Gt = Rt+1 + Rt+2 + ... + RT = Rt+k+1
k=0

I Discounted return (finite or infinite horizon pb.)

−t−1
TX
Gt = Rt+1 + γRt+2 + ... + γ T −t RT = γ k Rt+k+1
k=0

I Average return (continuing, infinite horizon pb.)

−t−1
TX
1 1
Gt = (Rt+1 + Rt+2 + ... + RT ) = Rt+k+1
T −t −1 T −t −1
k=0

Note: These are random variables that depends on MDP and policy
Discounted Return

I Discounted returns Gt for infinite horizon T → ∞:



X
Gt = Rt+1 + γRt+2 + ... = γ k Rt+k+1
k=0

I The discount γ ∈ [0, 1] is the present value of future rewards


I The marginal value of receiving reward R after k + 1 time-steps is γ k R
I For γ < 1, immediate rewards are more important than delayed rewards
I γ close to 0 leads to ”myopic” evaluation
I γ close to 1 leads to ”far-sighted” evaluation
Why discount?

Most Markov decision processes are discounted. Why?


I Problem specification:
I Immediate rewards may actually be more valuable (e.g., consider earning interest)
I Animal/human behaviour shows preference for immediate reward
I Solution side:
I Mathematically convenient to discount rewards
I Avoids infinite returns in cyclic Markov processes
I The way to think about it: reward and discount together determine the goal
Policies

Goal of an RL agent
To find a behaviour policy that maximises the (expected) return Gt

I A policy is a mapping π : S × A → [0, 1] that, for every state s assigns for each
action a ∈ A the probability of taking that action in state s. Denoted by π(a|s).
I For deterministic policies, we sometimes use the notation at = π(st ) to denote the
action taken by the policy.
Value Functions

I The value function v (s) gives the long-term value of state s

vπ (s) = E [Gt | St = s, π]

I We can define (state-)action values:

qπ (s, a) = E [Gt | St = s, At = a, π]

I (Connection between them) Note that:


X
vπ (s) = π(a | s)qπ (s, a) = E [qπ (St , At ) | St = s, π] , ∀s
a
Optimal Value Function

Definition (Optimal value functions)


The optimal state-value function v ∗ (s) is the maximum value function over all policies

v ∗ (s) = max vπ (s)


π

The optimal action-value function q ∗ (s, a) is the maximum action-value function over
all policies

q ∗ (s, a) = max qπ (s, a)


π

I The optimal value function specifies the best possible performance in the MDP
I An MDP is “solved” when we know the optimal value function
Optimal Policy

Define a partial ordering over policies

π ≥ π0 ⇐⇒ vπ (s) ≥ vπ0 (s) , ∀s

Theorem (Optimal Policies)


For any Markov decision process
I There exists an optimal policy π ∗ that is better than or equal to all other policies,
π ∗ ≥ π, ∀π
(There can be more than one such optimal policy.)

I All optimal policies achieve the optimal value function, v π (s) = v ∗ (s)

I All optimal policies achieve the optimal action-value function, q π (s, a) = q ∗ (s, a)
Finding an Optimal Policy

An optimal policy can be found by maximising over q ∗ (s, a),

1 if a = argmax q ∗ (s, a)
(

π (s, a) = a∈A
0 otherwise

Observations:
I There is always a deterministic optimal policy for any MDP
I If we know q ∗ (s, a), we immediately have the optimal policy
I There can be multiple optimal policies
I If multiple actions maximize q∗ (s, ·), we can also just pick any of these
(including stochastically)
Bellman Equations
Value Function

I The value function v (s) gives the long-term value of state s

vπ (s) = E [Gt | St = s, π]

I It can be defined recursively:

vπ (s) = E [Rt+1 + γGt+1 | St = s, π]


= E [Rt+1 + γvπ (St+1 ) | St = s, At ∼ π(St )]
X XX
p(r , s 0 | s, a) r + γvπ (s 0 )

= π(a | s)
a r s0

I The final step writes out the expectation explicitly


Action values
I We can define state-action values

qπ (s, a) = E [Gt | St = s, At = a, π]

I This implies

qπ (s, a) = E [Rt+1 + γvπ (St+1 ) | St = s, At = a]


= E [Rt+1 + γqπ (St+1 , At+1 ) | St = s, At = a]
!
XX X
0 0 0 0 0
= p(r , s | s, a) r + γ π(a | s )qπ (s , a )
r s0 a0

I Note that
X
vπ (s) = π(a | s)qπ (s, a) = E [qπ (St , At ) | St = s, π] , ∀s
a
Bellman Equations

Theorem (Bellman Expectation Equations)


Given an MDP, M = hS, A, p, r , γi, for any policy π, the value functions obey the
following expectation equations:
" #
X X
vπ (s) = π(s, a) r (s, a) + γ p(s 0 |a, s)vπ (s 0 ) (5)
a s0
X X
0
qπ (s, a) = r (s, a) + γ p(s |a, s) π(a0 |s 0 )qπ (s 0 , a0 ) (6)
s0 a0 ∈A
The Bellman Optimality Equations

Theorem (Bellman Optimality Equations)


Given an MDP, M = hS, A, p, r , γi, the optimal value functions obey the following
expectation equations:
" #
X
v ∗ (s) = max r (s, a) + γ p(s 0 |a, s)v ∗ (s 0 ) (7)
a
s0
X
q ∗ (s, a) = r (s, a) + γ p(s 0 |a, s) max
0
q ∗ (s 0 , a0 ) (8)
a ∈A
s0

There can be no policy with a higher value than v∗ (s) = maxπ vπ (s), ∀s
Some intuition
(Reminder) Greedy on v ∗ = Optimal Policy
I An optimal policy can be found by maximising over q ∗ (s, a),

1 if a = argmax q ∗ (s, a)
(

π (s, a) = a∈A
0 otherwise

I Apply the Bellman Expectation Eq. (6):


X X
qπ∗ (s, a) = r (s, a) + γ p(s 0 |a, s) π ∗ (a0 |s 0 )qπ∗ (s 0 , a0 )
s0 a0 ∈A
| {z }
maxa0 q ∗ (s 0 ,a0 )
X
= r (s, a) + γ p(s 0 |a, s) max
0
q ∗ (s 0 , a0 )
a ∈A
s0
Solving RL problems using the Bellman Equations
Problems in RL

I Pb1: Estimating vπ or qπ is called policy evaluation or, simply, prediction


I Given a policy, what is my expected return under that behaviour?
I Given this treatment protocol/trading strategy, what is my expected return?

I Pb2 : Estimating v∗ or q∗ is sometimes called control, because these can be used


for policy optimisation
I What is the optimal way of behaving? What is the optimal value function?
I What is the optimal treatment? What is the optimal control policy to minimise
time, fuel consumption, etc?
Exercise:

I Consider the following MDP:

I The actions have a 0.9 probability of success and with 0.1 probably we remain in the
same state
I Rt = 0 for all transitions that end up in S0 , and Rt = −1 for all other transitions
Exercise: (pause to work this out)
I Consider the following MDP:

I The actions have a 0.9 probability of success and with 0.1 probably we remain in the
same state
I Rt = 0 for all transitions that end up in S0 , and Rt = −1 for all other transitions

I Q: Evaluation problems (Consider a discount γ = 0.9)


I What is vπ for π(s) = a1 (→), ∀s?
I What is vπ for the uniformly random policy?
I Same policy evaluation problems for γ = 0.0? (What do you notice?)
A solution
Bellman Equation in Matrix Form

I The Bellman value equation, for given π, can be expressed using matrices,

v = rπ + γPπ v

where

vi = v (si )
riπ = E [Rt+1 | St = si , At ∼ π(St )]
X
Pijπ = p(sj | si ) = π(a | si )p(sj | si , a)
a
Bellman Equation in Matrix Form
I The Bellman equation, for a given policy π, can be expressed using matrices,

v = rπ + γPπ v

I This is a linear equation that can be solved directly:

v = rπ + γPπ v
(I − γPπ ) v = rπ
v = (I − γPπ )−1 rπ

I Computational complexity is O(|S|3 ) — only possible for small problems


I There are iterative methods for larger problems
I Dynamic programming
I Monte-Carlo evaluation
I Temporal-Difference learning
Solving the Bellman Optimality Equation

I The Bellman optimality equation is non-linear


I Cannot use the same direct matrix solution as for policy optimisation (in general)

I Many iterative solution methods:


I Using models / dynamic programming
I Value iteration
I Policy iteration
I Using samples
I Monte Carlo
I Q-learning
I Sarsa
Dynamic Programming
Dynamic Programming
The 1950s were not good years for mathematical research. I felt I had to shield
the Air Force from the fact that I was really doing mathematics. What title,
what name, could I choose? I was interested in planning, in decision making,
in thinking. But planning is not a good word for various reasons. I decided
to use the word ‘programming.’ I wanted to get across the idea that this was
dynamic, this was time-varying—I thought, let’s kill two birds with one stone.
Let’s take a word that has a precise meaning, namely dynamic, in the classical
physical sense. It also is impossible to use the word, dynamic, in a pejorative
sense. Try thinking of some combination that will possibly give it a pejorative
meaning. It’s impossible. Thus, I thought dynamic programming was a good
name. It was something not even a Congressman could object to. So I used it
as an umbrella for my activities.
– Richard Bellman
(slightly paraphrased for conciseness)
Dynamic programming

Dynamic programming refers to a collection of algorithms that can be used


to compute optimal policies given a perfect model of the environment as a
Markov decision process (MDP).

Sutton & Barto 2018


I We will discuss several dynamic programming methods to solve MDPs
I All such methods consist of two important parts:
policy evaluation and policy improvement
Policy evaluation

I We start by discussing how to estimate

vπ (s) = E [Rt+1 + γvπ (St+1 ) | s, π]

I Idea: turn this equality into an update

Algorithm
I First, initialise v0 , e.g., to zero
I Then, iterate
∀s : vk+1 (s) ← E [Rt+1 + γvk (St+1 ) | s, π]
I Stopping: whenever vk+1 (s) = vk (s), for all s, we must have found vπ

I Q: Does this algorithm always converge?


Answer : Yes, under appropriate conditions (e.g., γ < 1). More next lecture!
Example: Policy evaluation
Policy evaluation
Policy evaluation
Policy evaluation + Greedy Improvement
Policy evaluation + Greedy Improvement
Policy Improvement
I The example already shows we can use evaluation to then improve our policy
I In fact, just being greedy with respect to the values of the random policy sufficed!
(That is not true in general)

Algorithm
Iterate, using

∀s : πnew (s) = argmax qπ (s, a)


a
= argmax E [Rt+1 + γvπ (St+1 ) | St = s, At = a]
a

Then, evaluate πnew and repeat

I Claim: One can show that vπnew (s) ≥ vπ (s), for all s
Policy Improvement: qπnew (s, a) ≥ qπ (s, a)
Policy Iteration

Policy evaluation Estimate v π


Policy improvement Generate π 0 ≥ π
Example: Jack’s Car Rental

I States: Two locations, maximum of 20 cars at each


I Actions: Move up to 5 cars overnight (-$2 each)
I Reward: $10 for each available car rented, γ = 0.9
I Transitions: Cars returned and requested randomly
n
I Poisson distribution, n returns/requests with prob λn! e −λ
I 1st location: average requests = 3, average returns = 3
I 2nd location: average requests = 4, average returns = 2
Example: Jack’s Car Rental – Policy Iteration
Policy Iteration

I Does policy evaluation need to converge to v π ?

I Or should we stop when we are ‘close’ ?


(E.g., with a threshold on the change to the values)
I Or simply stop after k iterations of iterative policy evaluation?
I In the small gridworld k = 3 was sufficient to achieve optimal policy

I Extreme: Why not update policy every iteration — i.e. stop after k = 1?
I This is equivalent to value iteration
Value Iteration

I We could take the Bellman optimality equation, and turn that into an update

∀s : vk+1 (s) ← max E [Rt+1 + γvk (St+1 ) | St = s, At = s]


a

I This is equivalent to policy iteration, with k = 1 step of policy evaluation between


each two (greedy) policy improvement steps

Algorithm: Value Iteration

I Initialise v0
I Update:vk+1 (s) ← maxa E [Rt+1 + γvk (St+1 ) | St = s, At = s]
I Stopping: whenever vk+1 (s) = vk (s), for all s, we must have found v ∗
Example: Shortest Path
g 0 0 0 0 0 -1 -1 -1 0 -1 -2 -2

0 0 0 0 -1 -1 -1 -1 -1 -2 -2 -2

0 0 0 0 -1 -1 -1 -1 -2 -2 -2 -2

0 0 0 0 -1 -1 -1 -1 -2 -2 -2 -2

Problem V1 V2 V3

0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3

-1 -2 -3 -3 -1 -2 -3 -4 -1 -2 -3 -4 -1 -2 -3 -4

-2 -3 -3 -3 -2 -3 -4 -4 -2 -3 -4 -5 -2 -3 -4 -5

-3 -3 -3 -3 -3 -4 -4 -4 -3 -4 -5 -5 -3 -4 -5 -6

V4 V5 V6 V7
Synchronous Dynamic Programming Algorithms

Problem Bellman Equation Algorithm


Iterative
Prediction Bellman Expectation Equation
Policy Evaluation
Bellman Expectation Equation
Control Policy Iteration
+ (Greedy) Policy Improvement
Control Bellman Optimality Equation Value Iteration

Observations:
I Algorithms are based on state-value function vπ (s) or v ∗ (s) ⇒ complexity
O(|A||S|2 ) per iteration, for |A| actions and |S| states
I Could also apply to action-value function qπ (s, a) or q ∗ (s, a) ⇒ complexity
O(|A|2 |S|2 ) per iteration
Extensions to Dynamic Programming
Asynchronous Dynamic Programming

I DP methods described so far used synchronous updates (all states in parallel)

I Asynchronous DP
I backs up states individually, in any order
I can significantly reduce computation
I guaranteed to converge if all states continue to be selected
Asynchronous Dynamic Programming

Three simple ideas for asynchronous dynamic programming:


I In-place dynamic programming
I Prioritised sweeping
I Real-time dynamic programming
In-Place Dynamic Programming

I Synchronous value iteration stores two copies of value function

for all s in S : vnew (s) ← max E [Rt+1 + γvold (St+1 ) | St = s]


a
vold ← vnew

I In-place value iteration only stores one copy of value function

for all s in S : v (s) ← max E [Rt+1 + γv (St+1 ) | St = s]


a
Prioritised Sweeping

I Use magnitude of Bellman error to guide state selection, e.g.




max E [Rt+1 + γv (St+1 ) | St = s] − v (s)

a

I Backup the state with the largest remaining Bellman error


I Update Bellman error of affected states after each backup
I Requires knowledge of reverse dynamics (predecessor states)
I Can be implemented efficiently by maintaining a priority queue
Real-Time Dynamic Programming

I Idea: only update states that are relevant to agent


I E.g., if the agent is in state St , update that state value, or states that it expects
to be in soon
Full-Width Backups

I Standard DP uses full-width backups


I For each backup (sync or async)
I Every successor state and action is considered s Vk+1(s)
I Using true model of transitions and reward function
I DP is effective for medium-sized problems (millions of a
states) r
I For large problems DP suffers from curse of dimensionality s' Vk(s')
I Number of states n = |S| grows exponentially with number
of state variables
I Even one full backup can be too expensive
Sample Backups

I In subsequent lectures we will consider sample backups


s Vk+1(s)
I Using sample rewards and sample transitions hs, a, r , s 0 i
(Instead of reward function r and transition dynamics p) a
I Advantages:
r
I Model-free: no advance knowledge of MDP required
I Breaks the curse of dimensionality through sampling s'

I Cost of backup is constant, independent of n = |S|


Summary
What have we covered today?

I Markov Decision Processes


I Objectives in an MDP: different notion of return
I Value functions - expected returns, condition on state (and action)
I Optimality principles in MDPs: optimal value functions and optimal policies
I Bellman Equations
I Two class of problems in RL: evaluation and control
I How to compute vπ (aka solve an evaluation/prediction problem)
I How to compute the optimal value function via dynamic programming:
I Policy Iteration
I Value Iteration
Questions?

The only stupid question is the one you were afraid to ask but never did.
-Rich Sutton

For questions that may arise during this lecture please use Moodle and/or the next
Q&A session.

You might also like