RLHF & Alignment#
Reinforcement Learning from Human Feedback#
Pipeline (InstructGPT / ChatGPT):
- SFT (Supervised Fine-Tuning): fine-tune pretrained LLM on human-written demonstrations
- Reward Model: train RM on human preferences over pairs of outputs
- PPO: optimize SFT model against RM using RL, with KL penalty to stay near SFT
$$L_\text{PPO} = \mathbb{E}[r(x,y)] - \beta \cdot \text{KL}(\pi_\theta | \pi_\text{SFT})$$
Reward Modeling#
Given pairs $(y_w, y_l)$ where $y_w$ is preferred:
$$L_\text{RM} = -\log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))$$
Bradley-Terry preference model. RM is typically the SFT model with a linear head replacing the LM head.
PPO in RLHF#
Proximal Policy Optimization — clips policy ratio to prevent large updates:
$$L_\text{CLIP} = \mathbb{E}!\left[\min!\left(r_t(\theta)\hat{A}_t,; \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)\hat{A}_t\right)\right]$$
$$r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_\text{old}}(a_t \mid s_t)}, \quad \varepsilon \approx 0.2$$
Direct Preference Optimization (DPO)#
Eliminates separate RM and RL — directly optimize LLM on preferences:
$$L_\text{DPO} = -\log \sigma!\left(\beta \log\frac{\pi_\theta(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} - \beta \log\frac{\pi_\theta(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)}\right)$$
Simpler, more stable, competitive with PPO. Used in LLaMA 2, Zephyr.
RLAIF#
Replace human labelers with AI labeler (Claude, GPT-4). Constitutional AI (Anthropic):
- Generate critique of harmful response
- Revise based on critique
- Use revised response as SFT/RM data
KL Penalty#
$\text{KL}(\pi_\theta | \pi_\text{SFT})$ controls how far policy drifts from SFT. Too small → reward hacking; too large → underfits reward.
Reward Hacking#
Model finds unintended ways to maximize reward (repetition, sycophancy, length inflation). Mitigated by: better RM coverage, iterative RM updates, rule-based filters.
Constitutional AI#
Principles → critiques → revisions → SL-CAI (supervised) → RL-CAI (RL with AI feedback). Reduces need for human feedback on harmful content.