Language Models#

Autoregressive Language Modeling#

Model joint probability as product of conditionals:

$$P(x_1, \ldots, x_n) = \prod_i P(x_i \mid x_1, \ldots, x_{i-1})$$

Training objective: minimize negative log-likelihood (cross-entropy):

$$L = -\sum_i \log P(x_i \mid x_{<i};, \theta)$$

Perplexity#

$$\text{PPL} = \exp(L) = \exp!\left(-\frac{1}{n} \sum_i \log P(x_i \mid x_{<i})\right)$$

Lower is better. Perplexity ≈ “effective branching factor” — how uncertain the model is at each step.

Masked Language Modeling (BERT)#

Mask ~15% of tokens; predict them from context (both directions):

$$L = -\sum_{i \in \text{masked}} \log P(x_i \mid x_{\setminus \text{masked}})$$

Bidirectional → better representations; can’t do generation.

Scaling Laws#

Loss follows power law with compute $C$:

$$L(C) \propto C^{-\alpha}, \quad \alpha \approx 0.05$$

More compute → lower loss, predictably. Enables planning training runs.

Sampling Strategies#

Method	Description	Use
Greedy	always pick $\arg\max$	fast, repetitive
Beam search	keep top-$k$ beams	translation, summarization
Temperature	divide logits by $T$ before softmax	$T<1$ sharper, $T>1$ flatter
Top-$k$	sample from top $k$ tokens	reduces tail risk
Top-$p$ (nucleus)	sample from top-$p$ probability mass	adaptive, widely used
Min-$p$	filter tokens below $p \times \max\text{prob}$	recent, avoids incoherence

Typical for chat: temperature = 0.7, top-$p$ = 0.9.

KV Cache#

During inference, reuse past key/value projections:

Without cache: $O(n^2)$ per token generated
With cache: $O(n)$ per token (only compute for new token)

Memory: $2 \times n_\text{layers} \times n_\text{heads} \times d_\text{head} \times \text{seq_len} \times \text{bytes_per_param}$

Context Length Extensions#

Positional interpolation: scale RoPE angles to extend beyond training length
YaRN: NTK-aware interpolation, better perplexity at long context
Sliding window attention (Mistral): each token attends to fixed window + global tokens

Instruction Tuning#

Fine-tune on (instruction, response) pairs → follows natural language instructions. FLAN, InstructGPT, Alpaca, Vicuna — progressively better instruction following.

Format matters: system prompt + user turn + assistant turn.