Word Embeddings#

One-Hot Encoding#

Vocabulary of size $V$: each word → $V$-dimensional binary vector. No similarity structure; dimensionality scales with vocab.

Predict word from context (CBOW) or context from word (Skip-gram):

Skip-gram: maximize $P(\text{context} \mid \text{center}) = \prod P(w_o \mid w_i)$

Negative sampling: instead of full softmax over $V$, sample $k$ negative examples:

$$L = \log \sigma(v_o^\top v_i) + \sum_k \mathbb{E}[\log \sigma(-v_n^\top v_i)]$$

Result: similar words have similar vectors. Encodes analogy structure: king $-$ man $+$ woman $\approx$ queen.

Global Vectors — factorize word co-occurrence matrix:

$$J = \sum_{ij} f(X_{ij}),(w_i^\top \tilde{w}_j + b_i + \tilde{b}j - \log X{ij})^2$$

$X_{ij}$ = co-occurrence count. $f(x) = (x/x_\max)^\alpha$ for $x < x_\max$, else 1 (down-weights rare and frequent pairs).

Same word gets different embedding depending on context:

Contextual embeddings capture polysemy (“bank”: riverbank vs. financial institution).

Sentence-BERT: siamese BERT fine-tuned on NLI pairs. Cosine similarity of [CLS] or mean-pooled embeddings.

SimCSE: contrastive learning — augment via dropout, push apart different sentences.

text-embedding-ada-002 / text-embedding-3: OpenAI’s dense retrieval embeddings.

Cosine similarity: normalized dot product, range $[-1, 1]$, standard for semantic similarity
Isotropy: ideal embeddings spread uniformly in space (not clustered in narrow cone)
Dimensionality: typical 128–4096 dims; larger → more expressive, diminishing returns