Markov Decision Processes#

MDP Definition#

Tuple $(S, A, P, R, \gamma)$:

  • $S$: state space
  • $A$: action space
  • $P(s’ \mid s, a)$: transition dynamics
  • $R(s, a, s’)$: reward function
  • $\gamma \in [0,1)$: discount factor

Markov property: future depends only on current state, not history.

Value Functions#

State value: expected discounted return from state $s$ under policy $\pi$:

$$V^\pi(s) = \mathbb{E}_\pi!\left[\sum_t \gamma^t r_t ;\middle|; s_0 = s\right]$$

Action value (Q-function): expected return from state $s$, taking action $a$, then following $\pi$:

$$Q^\pi(s,a) = \mathbb{E}_\pi!\left[\sum_t \gamma^t r_t ;\middle|; s_0 = s,, a_0 = a\right]$$

Relationship: $V^\pi(s) = \sum_a \pi(a \mid s), Q^\pi(s, a)$

Bellman Equations#

Bellman expectation:

$$V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s’} P(s’ \mid s, a),[R(s,a,s’) + \gamma V^\pi(s’)]$$

Bellman optimality:

$$V^(s) = \max_a \sum_{s’} P(s’ \mid s, a),[R(s,a,s’) + \gamma V^(s’)]$$

$$Q^(s,a) = \sum_{s’} P(s’ \mid s, a)\left[R + \gamma \max_{a’} Q^(s’, a’)\right]$$

Policy#

Deterministic: $\pi(s) = a$

Stochastic: $\pi(a \mid s) = P(\text{action}=a \mid \text{state}=s)$

Optimal policy: $\pi^(s) = \arg\max_a Q^(s, a)$

Temporal Difference Learning#

Update toward bootstrapped target:

$$V(s_t) \leftarrow V(s_t) + \alpha,[r_t + \gamma V(s_{t+1}) - V(s_t)]$$

TD error: $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$

Q-Learning (Off-Policy)#

$$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha!\left[r_t + \gamma \max_{a’} Q(s_{t+1}, a’) - Q(s_t, a_t)\right]$$

Converges to $Q^*$ under tabular settings with sufficient exploration.

Exploration#

  • $\varepsilon$-greedy: with prob $\varepsilon$ take random action, else greedy
  • UCB: upper confidence bound — optimism in face of uncertainty
  • Boltzmann: $\pi(a \mid s) \propto \exp(Q(s,a)/\tau)$