Deep RL Algorithms#

DQN (Deep Q-Network)#

Q-Learning with neural network approximator:

$$L = \mathbb{E}!\left[\left(r + \gamma \max_{a’} Q_\text{target}(s’, a’) - Q(s, a)\right)^2\right]$$

Key stabilization tricks:

  • Experience replay: replay buffer breaks correlation between consecutive samples
  • Target network: separate $Q_\text{target}$ updated slowly (every $N$ steps) — prevents chasing a moving target

Double DQN: use online network to select action, target network to evaluate:

$$\text{target} = r + \gamma, Q_\text{target}(s’,, \arg\max_{a’} Q_\text{online}(s’, a’))$$

Dueling DQN: $Q(s,a) = V(s) + A(s,a) - \overline{A}(s)$

SAC (Soft Actor-Critic)#

Off-policy actor-critic with entropy maximization:

$$J(\pi) = \mathbb{E}!\left[\sum_t \gamma^t\bigl(r(s_t, a_t) + \alpha, H(\pi(\cdot \mid s_t))\bigr)\right]$$

  • Automatic temperature $\alpha$ tuning
  • Continuous action spaces
  • Sample efficient, stable
  • Standard for robotics/continuous control

TD3 (Twin Delayed DDPG)#

Addresses Q overestimation:

  • Twin critics: take minimum of two Q estimates
  • Delayed policy updates: update policy less frequently than critic
  • Target policy smoothing: add noise to target actions

Algorithm Summary#

Algorithm Space Type Notes
DQN discrete off-policy Atari games
PPO both on-policy stable, general purpose
SAC continuous off-policy robotics, efficient
TD3 continuous off-policy continuous control
DDPG continuous off-policy predecessor to SAC/TD3
A3C/A2C both on-policy parallel envs

Model-Based RL#

Learn world model $P(s’ \mid s, a)$, plan with it.

  • Dyna: learn model, generate synthetic experience
  • MuZero: learn model + value/policy together (AlphaZero without rules)
  • Dreamer: learn latent dynamics model, optimize in imagination

More sample-efficient; harder to train good models.

Reward Shaping#

Add auxiliary rewards to guide learning:

  • Potential-based: $F(s, s’) = \gamma \Phi(s’) - \Phi(s)$ — doesn’t change optimal policy
  • Curiosity (ICM): reward for prediction error of next state — intrinsic motivation
  • HER (Hindsight Experience Replay): relabel failed trajectories with achieved goal as target