RLHF & Alignment#

Reinforcement Learning from Human Feedback#

Pipeline (InstructGPT / ChatGPT):

SFT (Supervised Fine-Tuning): fine-tune pretrained LLM on human-written demonstrations
Reward Model: train RM on human preferences over pairs of outputs
PPO: optimize SFT model against RM using RL, with KL penalty to stay near SFT

$$L_\text{PPO} = \mathbb{E}[r(x,y)] - \beta \cdot \text{KL}(\pi_\theta | \pi_\text{SFT})$$

Reward Modeling#

Given pairs $(y_w, y_l)$ where $y_w$ is preferred:

$$L_\text{RM} = -\log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))$$

Bradley-Terry preference model. RM is typically the SFT model with a linear head replacing the LM head.

PPO in RLHF#

Proximal Policy Optimization — clips policy ratio to prevent large updates:

$$L_\text{CLIP} = \mathbb{E}!\left[\min!\left(r_t(\theta)\hat{A}_t,; \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)\hat{A}_t\right)\right]$$

$$r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_\text{old}}(a_t \mid s_t)}, \quad \varepsilon \approx 0.2$$

Direct Preference Optimization (DPO)#

Eliminates separate RM and RL — directly optimize LLM on preferences:

$$L_\text{DPO} = -\log \sigma!\left(\beta \log\frac{\pi_\theta(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} - \beta \log\frac{\pi_\theta(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)}\right)$$

Simpler, more stable, competitive with PPO. Used in LLaMA 2, Zephyr.

RLAIF#

Replace human labelers with AI labeler (Claude, GPT-4). Constitutional AI (Anthropic):

Generate critique of harmful response
Revise based on critique
Use revised response as SFT/RM data

KL Penalty#

$\text{KL}(\pi_\theta | \pi_\text{SFT})$ controls how far policy drifts from SFT. Too small → reward hacking; too large → underfits reward.

Reward Hacking#

Model finds unintended ways to maximize reward (repetition, sycophancy, length inflation). Mitigated by: better RM coverage, iterative RM updates, rule-based filters.

Constitutional AI#

Principles → critiques → revisions → SL-CAI (supervised) → RL-CAI (RL with AI feedback). Reduces need for human feedback on harmful content.