Tokenization#
Converting raw text to discrete tokens that models can process.
Byte-Pair Encoding (BPE)#
Algorithm:
- Start with character vocabulary
- Count all adjacent pair frequencies
- Merge most frequent pair → new token
- Repeat until vocabulary size V reached
Used in: GPT-2, GPT-3, LLaMA, Mistral.
Vocabulary size: typically 32K–100K tokens. GPT-4 uses ~100K (cl100k_base).
WordPiece#
Similar to BPE but merges pairs that maximize likelihood of training data (BERT, DistilBERT).
Subword tokens prefixed with ## to indicate continuation.
Unigram Language Model#
Starts with large vocab, iteratively removes tokens that minimally hurt likelihood. SentencePiece (T5, LLaMA 2+).
SentencePiece#
Language-agnostic tokenizer that treats input as raw bytes/characters (no pre-tokenization needed). Can use BPE or unigram algorithm. Handles any script and whitespace uniformly.
Byte-Level Tokenization#
Operate on raw bytes (0–255) as base units. No unknown tokens possible. GPT-2 uses byte-level BPE.
Tokenization Issues#
| Issue | Example | Impact |
|---|---|---|
| Word splitting | “running” → [“run”, “##ning”] | model must learn morphology |
| Rare words | “xyzzy” → [“x”, “y”, “z”, “z”, “y”] | inefficient |
| Numbers | “1234” → [“12”, “34”] or [“1”,“2”,“3”,“4”] | arithmetic is hard |
| Non-English | languages underrepresented in training data → more tokens per word | higher cost |
| Whitespace | " hello" ≠ “hello” in most tokenizers | subtle bugs |
Vocabulary Size Tradeoff#
| Smaller vocab | Larger vocab |
|---|---|
| more tokens per sequence (longer context) | fewer tokens per sequence (efficient) |
| simpler embeddings | larger embedding table |
| more morphological learning | less morphological learning |