Decision Analytics Markov Decision Processes Notes PDF

Title	Decision Analytics Markov Decision Processes Notes
Course	Decision Analytics
Institution	The University of Edinburgh
Pages	3
File Size	192.3 KB
File Type	PDF
Total Downloads	95
Total Views	155

Preview

CLICK TO PREVIEW PDF

Summary

These are all the notes covered in Week 4, 5 & 6. These include the study of the Markov Decision Process within Decision Analytics....

Description

Markov Decision Processes About MDP A Markov chain is a dynamic system whose future probabilistic behaviour depends on the present state only. A Markov decision process is a dynamic system whose future probabilistic behaviour depends on the present state and the decision taken. • Process predicts how a process is going to evolve but we don’t have control over that The basic elements of a Markov decision process are: 1. the state description (cf Markov chains); 2. the action description; 3. the immediate rewards; 4. the state transitions (unlike Markov chains these depend on the action taken).

Formulating a Markov decision process model A finite state, finite action, stationary Markov decision process is characterised by: 1. state description — S = {1, 2,…, M}, the set of possible states. 2. action description — for each i 2 S, Ki = set of possible actions (or decisions). 3. immediate reward — for each i 2 S & k 2 Ki, r ki = reward received when process is in state i and action k is chosen. 4. state transitions — for each i 2 S & k 2 Ki, p ki, j = probability of process making a transition to state j when process is in state i and action k is chosen. 5. Optimality Equation: The form of the optimality equation depends on the decision criterion and planning horizon chosen. • Expected total reward over a finite horizon; • Expected discounted reward over a finite horizon; • Expected discounted reward over an infinite horizon; • Long-run average reward.

Properties of policies A policy for a Markov decision process is a rule that prescribes the action to take in each state at each step. Aim is to find a policy that yields • the maximum expected total reward; • the maximum expected discounted reward; • or the maximum long-run average reward per step. A stationary policy d is a policy that prescribes a single action di whenever the process is in state i. One of the main advantages of an infinite planning horizon is that there is often a stationary optimal policy. With the expected discounted reward criterion, there is always a stationary optimal policy. Under a stationary policy d the process behaves as a Markov chain with state space S = {1, 2,…, M} and transition matrix. Thus, Markov chain methods of analysis can then be applied.

Infinite planning horizons In general, expected total reward is not suitable as a decision criterion. • For example, suppose rki > 0 for all i and k. • Regardless of the actions chosen, expected total reward → ∞ as number of steps → ∞. • All policies (good or bad) appear the same. There are two ways to overcome this: 1. Long-run average reward per step; 2. Expected discounted reward.

Expected total reward over a finite horizon Problem is to maximise expected total reward over N steps. Let vni = maximum expected total reward over n steps when the process is in state i initially. • • •

Optimality equation: Choose suitable values for v 0 and use optimality equation to find v 1,v 2,…,v N. Optimal actions can be different in the same state at different steps. Note that v 0 need not always be 0.

Expected discounted reward over a finite horizon Rewards are discounted at a rate b per step, 0 < b < 1. Let vni = maximum expected discounted reward over n steps when the process is in state i initially. • • • •

b is called the discount factor. Present value of £1 received at next step is £b , at next again step is £b 2, and so on. Aim is to maximise expected discounted reward over N steps. Optimality equation: Choose suitable values for v 0 and use optimality equation to find v 1,v 2,…,v N.

Long Long--run average reward per step Expected discounted reward over an infinite horizon Let vni = maximum expected total reward over n steps when the process is in state i initially. •

Maximum long-run average reward per step when the process is in state i initially = limit of

• •

vni n

as n tends to infinity.

In many cases the maximum long-run average reward per step is independent of the starting state. With this criterion all emphasis is on the distant future — early rewards do not influence policy.

Expected discounted reward over an infinite horizon Let vi = maximum expected discounted reward over an infinite horizon when process is in state i initially.

If immediate rewards are bounded so that −C ≤ rki ≤ C for all i and k: • maximum infinite horizon expected discounted reward •

maximum infinite horizon expected discounted reward

Optimality equation becomes: With this criterion, early rewards are the most important ones.

Approximations to an optimal policy Let vi (d ) be the expected discounted reward over an infinite horizon using the stationary policy d and starting in state i.

Let v*i be the maximum expected discounted reward over an infinite horizon when the process is in state i initially. d is called an e-optimal policy if |v*i − vi(d )| < ε for all i ∈ S.

Solving infinite horizon discounted MDPs An infinite horizon discounted MDP requires the solution of the following optimality equation.

Two solution methods are available: 1. value iteration — solve n-step problem for larger and larger n. 2. policy iteration — solve for a particular policy, try to improve the policy and repeat. Both methods find: • the unique optimal values v*i for all i 2 S (or good approximations to them). • and a stationary optimal policy d (or a good approximation to one).

Value iteration The value iteration algorithm is as follows. 1. Initialization: Let v0i = 0 for all i 2 S and let n = 1. 2. Iteration: Calculate for all i 2 S and let d ni = action that maximizes RHS for vni for all i 2 S. 3. Stopping criterion: If stopping criterion is not satisfied let n = n + 1 and repeat from step 2. As n → ∞, vni → v*I and d n → a stationary optimal policy....