Tokenization#

Converting raw text to discrete tokens that models can process.

Byte-Pair Encoding (BPE)#

Algorithm:

Start with character vocabulary
Count all adjacent pair frequencies
Merge most frequent pair → new token
Repeat until vocabulary size V reached

Used in: GPT-2, GPT-3, LLaMA, Mistral.

Vocabulary size: typically 32K–100K tokens. GPT-4 uses ~100K (cl100k_base).

WordPiece#

Similar to BPE but merges pairs that maximize likelihood of training data (BERT, DistilBERT). Subword tokens prefixed with ## to indicate continuation.

Unigram Language Model#

Starts with large vocab, iteratively removes tokens that minimally hurt likelihood. SentencePiece (T5, LLaMA 2+).

SentencePiece#

Language-agnostic tokenizer that treats input as raw bytes/characters (no pre-tokenization needed). Can use BPE or unigram algorithm. Handles any script and whitespace uniformly.

Byte-Level Tokenization#

Operate on raw bytes (0–255) as base units. No unknown tokens possible. GPT-2 uses byte-level BPE.

Tokenization Issues#

Issue	Example	Impact
Word splitting	“running” → [“run”, “##ning”]	model must learn morphology
Rare words	“xyzzy” → [“x”, “y”, “z”, “z”, “y”]	inefficient
Numbers	“1234” → [“12”, “34”] or [“1”,“2”,“3”,“4”]	arithmetic is hard
Non-English	languages underrepresented in training data → more tokens per word	higher cost
Whitespace	" hello" ≠ “hello” in most tokenizers	subtle bugs

Vocabulary Size Tradeoff#

Smaller vocab	Larger vocab
more tokens per sequence (longer context)	fewer tokens per sequence (efficient)
simpler embeddings	larger embedding table
more morphological learning	less morphological learning