Tokenization#

Converting raw text to discrete tokens that models can process.

Byte-Pair Encoding (BPE)#

Algorithm:

  1. Start with character vocabulary
  2. Count all adjacent pair frequencies
  3. Merge most frequent pair → new token
  4. Repeat until vocabulary size V reached

Used in: GPT-2, GPT-3, LLaMA, Mistral.

Vocabulary size: typically 32K–100K tokens. GPT-4 uses ~100K (cl100k_base).

WordPiece#

Similar to BPE but merges pairs that maximize likelihood of training data (BERT, DistilBERT). Subword tokens prefixed with ## to indicate continuation.

Unigram Language Model#

Starts with large vocab, iteratively removes tokens that minimally hurt likelihood. SentencePiece (T5, LLaMA 2+).

SentencePiece#

Language-agnostic tokenizer that treats input as raw bytes/characters (no pre-tokenization needed). Can use BPE or unigram algorithm. Handles any script and whitespace uniformly.

Byte-Level Tokenization#

Operate on raw bytes (0–255) as base units. No unknown tokens possible. GPT-2 uses byte-level BPE.

Tokenization Issues#

Issue Example Impact
Word splitting “running” → [“run”, “##ning”] model must learn morphology
Rare words “xyzzy” → [“x”, “y”, “z”, “z”, “y”] inefficient
Numbers “1234” → [“12”, “34”] or [“1”,“2”,“3”,“4”] arithmetic is hard
Non-English languages underrepresented in training data → more tokens per word higher cost
Whitespace " hello" ≠ “hello” in most tokenizers subtle bugs

Vocabulary Size Tradeoff#

Smaller vocab Larger vocab
more tokens per sequence (longer context) fewer tokens per sequence (efficient)
simpler embeddings larger embedding table
more morphological learning less morphological learning