Word Embeddings#
One-Hot Encoding#
Vocabulary of size $V$: each word → $V$-dimensional binary vector. No similarity structure; dimensionality scales with vocab.
Word2Vec#
Predict word from context (CBOW) or context from word (Skip-gram):
Skip-gram: maximize $P(\text{context} \mid \text{center}) = \prod P(w_o \mid w_i)$
Negative sampling: instead of full softmax over $V$, sample $k$ negative examples:
$$L = \log \sigma(v_o^\top v_i) + \sum_k \mathbb{E}[\log \sigma(-v_n^\top v_i)]$$
Result: similar words have similar vectors. Encodes analogy structure: king $-$ man $+$ woman $\approx$ queen.
GloVe#
Global Vectors — factorize word co-occurrence matrix:
$$J = \sum_{ij} f(X_{ij}),(w_i^\top \tilde{w}_j + b_i + \tilde{b}j - \log X{ij})^2$$
$X_{ij}$ = co-occurrence count. $f(x) = (x/x_\max)^\alpha$ for $x < x_\max$, else 1 (down-weights rare and frequent pairs).
Contextual Embeddings#
Same word gets different embedding depending on context:
- ELMo: BiLSTM layers, combine all hidden states
- BERT: Transformer encoder, [CLS] token for classification
- GPT: Transformer decoder, last token or mean pooling
Contextual embeddings capture polysemy (“bank”: riverbank vs. financial institution).
Sentence Embeddings#
Sentence-BERT: siamese BERT fine-tuned on NLI pairs. Cosine similarity of [CLS] or mean-pooled embeddings.
SimCSE: contrastive learning — augment via dropout, push apart different sentences.
text-embedding-ada-002 / text-embedding-3: OpenAI’s dense retrieval embeddings.
Embedding Space Properties#
- Cosine similarity: normalized dot product, range $[-1, 1]$, standard for semantic similarity
- Isotropy: ideal embeddings spread uniformly in space (not clustered in narrow cone)
- Dimensionality: typical 128–4096 dims; larger → more expressive, diminishing returns