Deep RL Algorithms#

DQN (Deep Q-Network)#

Q-Learning with neural network approximator:

$$L = \mathbb{E}!\left[\left(r + \gamma \max_{a’} Q_\text{target}(s’, a’) - Q(s, a)\right)^2\right]$$

Key stabilization tricks:

Experience replay: replay buffer breaks correlation between consecutive samples
Target network: separate $Q_\text{target}$ updated slowly (every $N$ steps) — prevents chasing a moving target

Double DQN: use online network to select action, target network to evaluate:

$$\text{target} = r + \gamma, Q_\text{target}(s’,, \arg\max_{a’} Q_\text{online}(s’, a’))$$

Dueling DQN: $Q(s,a) = V(s) + A(s,a) - \overline{A}(s)$

Off-policy actor-critic with entropy maximization:

$$J(\pi) = \mathbb{E}!\left[\sum_t \gamma^t\bigl(r(s_t, a_t) + \alpha, H(\pi(\cdot \mid s_t))\bigr)\right]$$

Addresses Q overestimation:

Algorithm	Space	Type	Notes
DQN	discrete	off-policy	Atari games
PPO	both	on-policy	stable, general purpose
SAC	continuous	off-policy	robotics, efficient
TD3	continuous	off-policy	continuous control
DDPG	continuous	off-policy	predecessor to SAC/TD3
A3C/A2C	both	on-policy	parallel envs

Learn world model $P(s’ \mid s, a)$, plan with it.

More sample-efficient; harder to train good models.

Add auxiliary rewards to guide learning:

Potential-based: $F(s, s’) = \gamma \Phi(s’) - \Phi(s)$ — doesn’t change optimal policy
Curiosity (ICM): reward for prediction error of next state — intrinsic motivation
HER (Hindsight Experience Replay): relabel failed trajectories with achieved goal as target