Language Models#

Autoregressive Language Modeling#

Model joint probability as product of conditionals:

$$P(x_1, \ldots, x_n) = \prod_i P(x_i \mid x_1, \ldots, x_{i-1})$$

Training objective: minimize negative log-likelihood (cross-entropy):

$$L = -\sum_i \log P(x_i \mid x_{<i};, \theta)$$

Perplexity#

$$\text{PPL} = \exp(L) = \exp!\left(-\frac{1}{n} \sum_i \log P(x_i \mid x_{<i})\right)$$

Lower is better. Perplexity ≈ “effective branching factor” — how uncertain the model is at each step.

Masked Language Modeling (BERT)#

Mask ~15% of tokens; predict them from context (both directions):

$$L = -\sum_{i \in \text{masked}} \log P(x_i \mid x_{\setminus \text{masked}})$$

Bidirectional → better representations; can’t do generation.

Scaling Laws#

Loss follows power law with compute $C$:

$$L(C) \propto C^{-\alpha}, \quad \alpha \approx 0.05$$

More compute → lower loss, predictably. Enables planning training runs.

Sampling Strategies#

Method Description Use
Greedy always pick $\arg\max$ fast, repetitive
Beam search keep top-$k$ beams translation, summarization
Temperature divide logits by $T$ before softmax $T<1$ sharper, $T>1$ flatter
Top-$k$ sample from top $k$ tokens reduces tail risk
Top-$p$ (nucleus) sample from top-$p$ probability mass adaptive, widely used
Min-$p$ filter tokens below $p \times \max\text{prob}$ recent, avoids incoherence

Typical for chat: temperature = 0.7, top-$p$ = 0.9.

KV Cache#

During inference, reuse past key/value projections:

  • Without cache: $O(n^2)$ per token generated
  • With cache: $O(n)$ per token (only compute for new token)

Memory: $2 \times n_\text{layers} \times n_\text{heads} \times d_\text{head} \times \text{seq_len} \times \text{bytes_per_param}$

Context Length Extensions#

  • Positional interpolation: scale RoPE angles to extend beyond training length
  • YaRN: NTK-aware interpolation, better perplexity at long context
  • Sliding window attention (Mistral): each token attends to fixed window + global tokens

Instruction Tuning#

Fine-tune on (instruction, response) pairs → follows natural language instructions. FLAN, InstructGPT, Alpaca, Vicuna — progressively better instruction following.

Format matters: system prompt + user turn + assistant turn.