Deep RL Algorithms#
DQN (Deep Q-Network)#
Q-Learning with neural network approximator:
$$L = \mathbb{E}!\left[\left(r + \gamma \max_{a’} Q_\text{target}(s’, a’) - Q(s, a)\right)^2\right]$$
Key stabilization tricks:
- Experience replay: replay buffer breaks correlation between consecutive samples
- Target network: separate $Q_\text{target}$ updated slowly (every $N$ steps) — prevents chasing a moving target
Double DQN: use online network to select action, target network to evaluate:
$$\text{target} = r + \gamma, Q_\text{target}(s’,, \arg\max_{a’} Q_\text{online}(s’, a’))$$
Dueling DQN: $Q(s,a) = V(s) + A(s,a) - \overline{A}(s)$
SAC (Soft Actor-Critic)#
Off-policy actor-critic with entropy maximization:
$$J(\pi) = \mathbb{E}!\left[\sum_t \gamma^t\bigl(r(s_t, a_t) + \alpha, H(\pi(\cdot \mid s_t))\bigr)\right]$$
- Automatic temperature $\alpha$ tuning
- Continuous action spaces
- Sample efficient, stable
- Standard for robotics/continuous control
TD3 (Twin Delayed DDPG)#
Addresses Q overestimation:
- Twin critics: take minimum of two Q estimates
- Delayed policy updates: update policy less frequently than critic
- Target policy smoothing: add noise to target actions
Algorithm Summary#
| Algorithm | Space | Type | Notes |
|---|---|---|---|
| DQN | discrete | off-policy | Atari games |
| PPO | both | on-policy | stable, general purpose |
| SAC | continuous | off-policy | robotics, efficient |
| TD3 | continuous | off-policy | continuous control |
| DDPG | continuous | off-policy | predecessor to SAC/TD3 |
| A3C/A2C | both | on-policy | parallel envs |
Model-Based RL#
Learn world model $P(s’ \mid s, a)$, plan with it.
- Dyna: learn model, generate synthetic experience
- MuZero: learn model + value/policy together (AlphaZero without rules)
- Dreamer: learn latent dynamics model, optimize in imagination
More sample-efficient; harder to train good models.
Reward Shaping#
Add auxiliary rewards to guide learning:
- Potential-based: $F(s, s’) = \gamma \Phi(s’) - \Phi(s)$ — doesn’t change optimal policy
- Curiosity (ICM): reward for prediction error of next state — intrinsic motivation
- HER (Hindsight Experience Replay): relabel failed trajectories with achieved goal as target