Markov Decision Processes#
MDP Definition#
Tuple $(S, A, P, R, \gamma)$:
- $S$: state space
- $A$: action space
- $P(s’ \mid s, a)$: transition dynamics
- $R(s, a, s’)$: reward function
- $\gamma \in [0,1)$: discount factor
Markov property: future depends only on current state, not history.
Value Functions#
State value: expected discounted return from state $s$ under policy $\pi$:
$$V^\pi(s) = \mathbb{E}_\pi!\left[\sum_t \gamma^t r_t ;\middle|; s_0 = s\right]$$
Action value (Q-function): expected return from state $s$, taking action $a$, then following $\pi$:
$$Q^\pi(s,a) = \mathbb{E}_\pi!\left[\sum_t \gamma^t r_t ;\middle|; s_0 = s,, a_0 = a\right]$$
Relationship: $V^\pi(s) = \sum_a \pi(a \mid s), Q^\pi(s, a)$
Bellman Equations#
Bellman expectation:
$$V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s’} P(s’ \mid s, a),[R(s,a,s’) + \gamma V^\pi(s’)]$$
Bellman optimality:
$$V^(s) = \max_a \sum_{s’} P(s’ \mid s, a),[R(s,a,s’) + \gamma V^(s’)]$$
$$Q^(s,a) = \sum_{s’} P(s’ \mid s, a)\left[R + \gamma \max_{a’} Q^(s’, a’)\right]$$
Policy#
Deterministic: $\pi(s) = a$
Stochastic: $\pi(a \mid s) = P(\text{action}=a \mid \text{state}=s)$
Optimal policy: $\pi^(s) = \arg\max_a Q^(s, a)$
Temporal Difference Learning#
Update toward bootstrapped target:
$$V(s_t) \leftarrow V(s_t) + \alpha,[r_t + \gamma V(s_{t+1}) - V(s_t)]$$
TD error: $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$
Q-Learning (Off-Policy)#
$$Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha!\left[r_t + \gamma \max_{a’} Q(s_{t+1}, a’) - Q(s_t, a_t)\right]$$
Converges to $Q^*$ under tabular settings with sufficient exploration.
Exploration#
- $\varepsilon$-greedy: with prob $\varepsilon$ take random action, else greedy
- UCB: upper confidence bound — optimism in face of uncertainty
- Boltzmann: $\pi(a \mid s) \propto \exp(Q(s,a)/\tau)$