Language Models#
Autoregressive Language Modeling#
Model joint probability as product of conditionals:
$$P(x_1, \ldots, x_n) = \prod_i P(x_i \mid x_1, \ldots, x_{i-1})$$
Training objective: minimize negative log-likelihood (cross-entropy):
$$L = -\sum_i \log P(x_i \mid x_{<i};, \theta)$$
Perplexity#
$$\text{PPL} = \exp(L) = \exp!\left(-\frac{1}{n} \sum_i \log P(x_i \mid x_{<i})\right)$$
Lower is better. Perplexity ≈ “effective branching factor” — how uncertain the model is at each step.
Masked Language Modeling (BERT)#
Mask ~15% of tokens; predict them from context (both directions):
$$L = -\sum_{i \in \text{masked}} \log P(x_i \mid x_{\setminus \text{masked}})$$
Bidirectional → better representations; can’t do generation.
Scaling Laws#
Loss follows power law with compute $C$:
$$L(C) \propto C^{-\alpha}, \quad \alpha \approx 0.05$$
More compute → lower loss, predictably. Enables planning training runs.
Sampling Strategies#
| Method | Description | Use |
|---|---|---|
| Greedy | always pick $\arg\max$ | fast, repetitive |
| Beam search | keep top-$k$ beams | translation, summarization |
| Temperature | divide logits by $T$ before softmax | $T<1$ sharper, $T>1$ flatter |
| Top-$k$ | sample from top $k$ tokens | reduces tail risk |
| Top-$p$ (nucleus) | sample from top-$p$ probability mass | adaptive, widely used |
| Min-$p$ | filter tokens below $p \times \max\text{prob}$ | recent, avoids incoherence |
Typical for chat: temperature = 0.7, top-$p$ = 0.9.
KV Cache#
During inference, reuse past key/value projections:
- Without cache: $O(n^2)$ per token generated
- With cache: $O(n)$ per token (only compute for new token)
Memory: $2 \times n_\text{layers} \times n_\text{heads} \times d_\text{head} \times \text{seq_len} \times \text{bytes_per_param}$
Context Length Extensions#
- Positional interpolation: scale RoPE angles to extend beyond training length
- YaRN: NTK-aware interpolation, better perplexity at long context
- Sliding window attention (Mistral): each token attends to fixed window + global tokens
Instruction Tuning#
Fine-tune on (instruction, response) pairs → follows natural language instructions. FLAN, InstructGPT, Alpaca, Vicuna — progressively better instruction following.
Format matters: system prompt + user turn + assistant turn.