> study notes · ml fundamentals
ML Embeddings & Representations
Word2Vec, GloVe, Sentence Transformers, contrastive learning, embedding geometry, similarity metrics, PCA, and UMAP. Every section opens with core theory then moves into worked exercises that test whether you actually understand it.
- Word2Vec
- GloVe
- SBERT
- SimCLR
- InfoNCE
- Hubness
- PCA
- UMAP
- Matryoshka
- CLIP
Section 1
Word2Vec
Intuition first
Word2Vec trains a shallow neural network on a simple prediction task: given a word, predict its neighbors (Skip-Gram), or given neighbors, predict the center word (CBOW). The network never actually uses its predictions — the weights themselves become the word vectors. Every word gets two vectors: an input vector v (used at inference) and an output vector u (discarded after training). A score between a center-context pair is just their dot product: v_sat · u_cat. High dot product = model thinks they co-occur.
The full softmax over all V words is too slow (O(V) per step), so Negative Sampling replaces it with binary classification: for each real pair (sat, cat), sample K fake pairs — (sat, the), (sat, mat) — and train the model to score real pairs high and fake pairs low. Cost drops to O(K).
The gradient is the engine. For a real pair (sat→cat): ∂J/∂v_sat = (σ(score) − 1) · u_cat. If the model is wrong (score is low, σ ≈ 0.1), the error is −0.9 — a large nudge pushing v_sat toward u_cat. For a fake pair (sat→the), the gradient pushes v_sat away from u_the. Do this millions of times and vectors self-organize: "cat" and "mat" both appear next to "the" and "a" constantly, so both get nudged toward the same output vectors repeatedly — ending up geometrically close. No labels needed. Similarity emerges purely from shared context.
At inference, cosine similarity between input vectors is your similarity score. The famous result — king − man + woman ≈ queen — works because the "gender" relationship encodes as a consistent linear direction in the space (Levy & Goldberg proved Word2Vec implicitly factorizes the PMI matrix, so linear structure in co-occurrence statistics maps to linear offsets in vector space).
Two things to nail in an interview
1. You use the input matrix V, not U — the output matrix is discarded after training.
2. The noise distribution is P_n(w) = f(w)^(3/4) / Z — raising frequencies to the 3/4 power stops "the" and "a" from dominating negative samples by compressing the frequency range without flattening it entirely.
Skip-Gram Objective
Negative Sampling (NEG)
Full softmax over vocabulary is O(V) — prohibitive. NEG approximates it with a binary classification task over K noise samples. Noise distribution P_n(w) = U(w)^3/4 / Z downsamples frequent words and upsamples rare ones.
Q1: Why does Skip-Gram outperform CBOW on rare words, while CBOW tends to be faster and better on frequent words?
▸ Answer
Skip-Gram makes K separate prediction tasks per word (one per context word). Each rare word gets many gradient updates — one for every window position in which it appears. CBOW averages context vectors before predicting, smoothing the signal. That averaging helps frequent words get stable gradients, but for rare words seen only a few times, the averaging loses information. Skip-Gram preserves the individual word-context relationships, giving each rare word more discriminative signal per occurrence.
In practice: use Skip-Gram when vocabulary has a heavy tail (domain-specific text); use CBOW when speed matters and vocabulary is balanced.
Q2: What bias does Negative Sampling introduce, and how does it affect downstream tasks?
▸ Answer
NEG trains each word pair as an independent binary classification problem rather than multi-class. The learned P(w_O | w_I) is not a valid probability distribution (doesn't sum to 1 over vocab). Frequency bias: words sampled more often as negatives have embeddings pushed away from more other vectors, causing their norms to be systematically smaller — so high-frequency function words cluster near the origin.
Downstream impact: for retrieval using cosine similarity, this bias is largely harmless since cosine normalizes norms. For tasks requiring calibrated probabilities, the embeddings should not be used directly.
Q3: The analogy 'king - man + woman ≈ queen' works in Word2Vec. Derive algebraically what linguistic property the model must have learned.
▸ Answer
The analogy holds when: v(king) - v(man) + v(woman) ≈ v(queen). Rearranging: v(king) - v(queen) ≈ v(man) - v(woman). This means the vector difference between 'king' and 'queen' equals the difference between 'man' and 'woman' — the model encoded the 'gender' concept as a consistent linear direction in embedding space.
Limitations: works ~65% of the time on Google's analogy test set. Fails for irregular mappings, polysemous words, and when the analogy direction interacts with frequency biases.
Q4 (Advanced): Levy & Goldberg (2014) proved Word2Vec with NEG implicitly factorizes the shifted PMI matrix. What does this imply about when analogy arithmetic fails?
▸ Answer
With K negatives, the optimal Word2Vec solution satisfies: v_w · v_c = PMI(w,c) - log K. So Word2Vec performs a low-rank factorization of the shifted PMI matrix — the same family as GloVe. Analogy arithmetic fails when:
- →Low-frequency words: PMI estimates are noisy, so the linear direction is unreliable.
- →Polysemous words: 'bank' averages over financial/river senses — analogy fails because the vector is not in a consistent semantic subspace.
- →Non-linear relationships: antonyms appear in identical contexts ('hot'/'cold'), so they cluster together rather than being linearly separated.
- →Domain shift: the gender direction in news text differs from social media — analogies are corpus-specific.
fastText Extension
fastText extends Word2Vec with subword n-gram embeddings: v(w) = mean of all character n-gram vectors. This enables OOV handling (any unseen word decomposes into known n-grams) and improves morphologically rich languages (Turkish, Finnish, German) by sharing parameters across inflected forms.
Section 2
GloVe — Global Vectors for Word Representation
GloVe (Pennington et al., 2014) directly factorizes the global word-word co-occurrence matrix instead of using a predictive local-context model. The intuition: the ratio of co-occurrence probabilities encodes semantic relationships more reliably than raw counts.
Q5: GloVe uses weighted least-squares loss on log co-occurrence counts. Justify every design choice from first principles.
▸ Answer
- →Log transformation: Raw counts span many orders of magnitude. Log compresses this range, preventing dominant pairs from drowning out informative rare pairs.
- →Weighting f(X_ij): Without weighting, a pair co-occurring 10,000× contributes 10,000× more to the loss. f caps contribution at x_max and uses α=0.75 (empirically better than 1) to soft-cap without hard thresholding.
- →Bias terms b_i, b_j: Absorb word-specific frequency effects, freeing vector dimensions to encode purely semantic content.
- →Sum of w and w̃: The symmetry X_ij = X_ji means averaging the two embedding matrices exploits this, halving variance without adding bias.
Q6: Compare Word2Vec (Skip-Gram + NEG) and GloVe theoretically and empirically. When would you choose one over the other?
▸ Answer
| Dimension | Word2Vec (Skip-Gram + NEG) | GloVe |
|---|---|---|
| Training | Online — one pair at a time, SGD | Batch — full co-occurrence matrix, then weighted LS |
| Weighting | Implicit (sampling frequency) | Explicit via f(X) — more principled |
| Scaling | Scales better with very large corpora | Faster convergence on fixed corpora |
| Rare words | Better — noise distribution oversamples rare words | Worse — f(X_ij)→0 for rare pairs, near-zero gradient |
Choose GloVe: fixed corpus, train once quickly, corpus <10B tokens. Choose Word2Vec: streaming/growing corpus, incremental fine-tuning, very large data.
Q7 (Advanced): GloVe often trains faster yet struggles more with rare words. Explain both from the objective function.
▸ Answer
Faster training: GloVe trains on non-zero entries of the sparse co-occurrence matrix X — far fewer than T corpus tokens. Plus the weighted LS objective has better curvature than binary classification in NEG, converging in fewer epochs.
Rare word struggle: For rare pairs, X_ij is small. Plugging into f(X_ij) = (X_ij/100)^0.75 ≈ near zero — their gradient contribution is near zero. Word2Vec's noise distribution P_n(w) ∝ f(w)^3/4 deliberately oversamples rare words; a rare word appearing once still gets sampled as a negative many times.
Practical implication: on domain-specific corpora with many rare technical terms (biomedical), Word2Vec generally produces better rare-word embeddings.
Section 3
Sentence Transformers (SBERT)
SBERT (Reimers & Gurevych, 2019) adapts BERT to produce fixed-size sentence embeddings via a pooling layer and contrastive fine-tuning. Vanilla BERT is unsuitable for sentence similarity out of the box: mean-pooling BERT embeddings is outperformed by GloVe mean-pooling, because BERT's pre-training (MLM + NSP) doesn't optimize for geometric similarity.
SBERT Architecture & Training Objectives
Q8: Why is mean-pooling over BERT token embeddings a poor sentence representation without fine-tuning?
▸ Answer
BERT is trained with MLM and NSP. Neither requires sentences with similar meanings to have nearby representations. MLM pushes each token's representation to be predictable from local context — the resulting sentence embedding is dominated by high-frequency tokens like 'the'.
Geometric issue — anisotropy: BERT representations occupy a narrow cone in high-dimensional space. Nearly all sentence embeddings have high cosine similarity (~0.99) with each other regardless of semantic content, because dominant directions are captured by the most frequent tokens. After SBERT fine-tuning, the model learns to spread representations across the cone, making cosine similarity meaningful.
Q9: Build a semantic search system over 100M documents with sub-50ms P99 latency. Full system design.
▸ Answer
- →Embedding model: Distilled SBERT (all-MiniLM-L6-v2, d=384). 5× speedup over full BERT with <5% accuracy loss. Quantize to INT8 for 2× memory reduction.
- →Offline indexing: Batch encode (GPU cluster, ~1M docs/hour). Store in float16. Build HNSW index (ef_construction=200, M=48) → >95% recall@10 at <5ms.
- →Two-stage serving: HNSW returns top-1000 candidates (~5ms) → cross-encoder re-ranks (~20ms). Total <30ms P50.
- →Scaling: Shard HNSW across 10 nodes (10M vectors each). Merge top-K results per shard. Use Weaviate/Milvus if incremental updates are needed (HNSW re-indexing is expensive).
- →Failure mode: Embedding drift when model updates. Maintain versioned indexes and do zero-downtime blue-green deployment.
Q10: Bi-encoders vs. cross-encoders — derive computational complexity and explain when to use each.
▸ Answer
| Bi-Encoder (SBERT) | Cross-Encoder | |
|---|---|---|
| Encoding | Query and doc independently | Query + doc concatenated |
| Pre-compute | Yes — offline document embeddings | No — must run at query time |
| Query cost | O(1) encoder + O(N·d) dot products | O(N·L²) — N full BERT passes |
| Accuracy | Lower (no token-level interaction) | Higher (full cross-attention) |
| Use for | First-stage retrieval (top-K) | Re-ranking the top-K results |
Production pattern: bi-encoder retrieves top-100 candidates fast → cross-encoder re-ranks those 100. The recall-speed tradeoff parameter is K — larger K improves recall but increases re-ranking cost.
Matryoshka Representation Learning (MRL)
MRL (Kusupati et al., 2022) trains a single model so any prefix of the embedding is itself a good representation. Loss: L_MRL = Sum_m w_m * L(e_1:m, labels) for m in {64,128,256,512,1024}. A d=1024 MRL embedding can be truncated to d=64 at inference for fast first-stage retrieval — no separate models, no extra storage. MRL at d=128 achieves ~97-99% of the recall@10 of a dedicated d=128 model.
Section 4
Contrastive Learning — SimCLR & InfoNCE
Contrastive learning learns representations by pulling together positive pairs and pushing apart negative pairs, without hand-labeled classes. SimCLR and InfoNCE are the foundational frameworks.
Core Idea
For each anchor, define a positive (same sample, different augmentation) and negatives (different samples). The loss maximizes similarity to the positive while minimizing similarity to all negatives. Structurally identical to cross-entropy softmax, but with no fixed labels — the "correct class" is dynamically defined per anchor.
SimCLR — NT-Xent Loss
Given a batch of N samples, create 2 augmented views each → 2N total. For anchor z_i, its positive is z_j (other view of same sample). Negatives are all other 2N−2 samples in the batch. Loss is computed symmetrically (both z_i→z_j and z_j→z_i) then averaged.
- →Same encoder for both views
- →Nonlinear projection head on top of encoder before loss — key finding from the SimCLR paper
- →Requires large batches (4096+) for enough negatives
InfoNCE — as used in MoCo
Query q is contrasted against 1 positive key k⁺ and K negatives from a memory bank. Loss is computed in one direction only (query → key).
- →Asymmetric: query encoder updated by backprop, key encoder updated by momentum (moving average)
- →Memory bank stores negatives from previous batches → K can be 65 536+
- →Loss computed in one direction only (query → key)
- →Negatives more diverse but slightly stale
SimCLR vs InfoNCE / MoCo
| SimCLR | InfoNCE / MoCo | |
|---|---|---|
| Negatives | In-batch (2N−2) | Memory bank (65 536+) |
| Encoder | Symmetric, same weights | Asymmetric, momentum key encoder |
| Loss direction | Symmetric (both views) | Asymmetric (query → key only) |
| Projection head | Yes — key contribution of SimCLR paper | MoCo v2+ borrowed this from SimCLR |
| Compute requirement | Large batch / TPUs needed | Scalable; smaller batches fine |
Q11: Derive why temperature τ in the InfoNCE loss matters. What happens as τ → 0 and τ → ∞?
▸ Answer
τ acts as a sharpness parameter on the similarity softmax.
τ → 0:
Softmax → hard argmax. Only the hardest negative receives gradient. Training becomes numerically unstable, causing collapse or very slow convergence.
τ → ∞:
Softmax → uniform over all negatives. Every negative contributes equally, including easy ones. Gradient is spread so thin the model barely learns to discriminate.
Optimal τ (~0.07):
Concentrates gradient on the top few hard negatives while remaining numerically stable. Also implicitly maximizes uniformity of the embedding distribution on the hypersphere (Wang & Isola 2020).
Q12: SimCLR requires very large batch sizes (4096+). Explain why, and describe two methods to make contrastive learning work with small batches.
▸ Answer
The InfoNCE denominator sums over 2N-2 negatives. With small N, the hardest negative is far from the positive (providing weak gradient). Also false negatives become proportionally larger at small N.
- →Memory bank (MoCo): Maintain a FIFO queue of embeddings from previous batches (K=65536 negatives without GPU memory overhead). A momentum encoder (β=0.999) ensures queue consistency.
- →Hard negative mining: Maintain a FAISS ANN index and mine negatives close to the anchor. A small batch of 64 with hand-picked hard negatives can outperform a random batch of 4096. Apply a similarity threshold to filter false negatives.
Q13: What is representation collapse? How do SimCLR, BYOL, and Barlow Twins each address it?
▸ Answer
Collapse: the encoder maps all inputs to the same constant vector. The loss is minimized trivially — positive and negative pairs are all at the same point.
Q14 (Advanced): SimCLR discards the projection head g(·) at inference, keeping only f(·). Why does this consistently improve downstream task performance?
▸ Answer
The projection head g(·) must make z = g(f(x)) invariant to augmentations. If augmentations include color jitter, z must discard color information. If they include cropping, z discards spatial layout. The projection head is the layer that discards this information — but augmentation-specific features (color, spatial layout) are often task-relevant.
Information bottleneck view: g maximizes I(z; augmentation-invariant features) while minimizing I(z; augmentation-specific features). f(·) retains the richer, more linearly separable representation. Design implication: the projection head can be aggressive (small bottleneck) without harming downstream performance, since it is discarded.
Section 5
Embedding Space Geometry
Embedding spaces are Euclidean (R^d), but distances, angles, and norms behave in surprising ways at high dimension. Understanding the geometry is critical for debugging retrieval systems.
Isotropy vs. Anisotropy
Isotropic embeddings are uniformly distributed across the unit hypersphere — maximizes the expressiveness of cosine similarity. Anisotropic embeddings (raw BERT) cluster in a narrow cone: most pairs appear highly similar regardless of semantic content. Measure: average cosine similarity of random pairs. Isotropic → ~0. Anisotropic → much greater than 0.
Q15: Explain the 'hubness' phenomenon in high-dimensional spaces. Derive why it occurs and describe two mitigations.
▸ Answer
Hubness: as dimensionality d increases, a small fraction of points become the nearest neighbor of O(N^0.5) other points. This causes top-K retrieval to return the same few hub points for most queries, degrading diversity and accuracy.
Why it occurs: In R^d, for random unit vectors, Var[sim(x,y)] = 1/d. Points near the centroid of the data distribution have systematically smaller average distance to all other points, making them hubs. The effect grows with dimensionality.
- →Mitigation 1 — Mutual k-NN: Only count x_j as a neighbor of x_i if x_i is also a neighbor of x_j. Hubs can be neighbors of many, but few are neighbors of hubs. Reduces hubness but reduces recall.
- →Mitigation 2 — CSLS score: sim_CSLS(x,y) = 2·cos(x,y) - r_k(x) - r_k(y), where r_k(x) is mean similarity of x to its k nearest neighbors. Normalizes by neighborhood crowdedness, penalizing hubs. Standard in cross-lingual word translation.
Q16: Your embedding model produces L2 norms that correlate strongly with word frequency. What geometric problem does this cause and how do you fix it?
▸ Answer
If ||v_w|| correlates with frequency, then dot product s(u,v) = u^T v ∝ ||u||·||v||·cos(θ). High-frequency words get large dot products with everything, biasing retrieval toward frequent words. Cosine similarity (cosine = u^Tv / (||u||·||v||)) is norm-invariant, so it is unaffected.
- →Fix 1: L2-normalize all embeddings before indexing. Projects everything onto the unit hypersphere. Cosine and dot product become equivalent.
- →Fix 2: Whitening transform: subtract mean, multiply by Σ^{-1/2}. Projects embeddings into an isotropic space. More expensive but removes both norm bias and anisotropy (effective for post-processing BERT embeddings).
Section 6
Similarity Metrics — When to Use What
| Metric | Formula | When to Use | Watch Out For |
|---|---|---|---|
| Cosine | u·v / (||u||·||v||) | Default for normalized embeddings; norm-invariant | Hubness still present; doesn't use magnitude info |
| Dot Product | u·v | Normalized embeddings; recommendation systems (MIPS) | Biased by norm/frequency if embeddings not normalized |
| Euclidean | ||u - v||₂ | After PCA/UMAP; metric-space trained embeddings | Curse of dimensionality; bad for raw high-d embeddings |
Q17: When does dot product outperform cosine similarity? Derive the condition formally.
▸ Answer
Dot product = ||u||·||v||·cos(θ). Cosine = cos(θ). Dot product adds the magnitude term ||u||·||v||. So dot product outperforms cosine when magnitude carries task-relevant signal.
This happens in recommendation systems: item popularity (encoded in the embedding norm) is itself a signal — a popular item should rank higher all else equal. In MIPS (Maximum Inner Product Search), you explicitly want the item with the highest norm-weighted similarity.
Condition: use dot product when embeddings are L2-normalized (then dot product = cosine, no difference) OR when magnitude encodes task-relevant information. Use cosine when you want pure directional similarity independent of training-time frequency effects.
Section 7
PCA — Dimensionality Reduction for Embeddings
PCA finds orthogonal directions of maximum variance in the embedding space. The first k principal components capture the most variance, compressing the embedding while losing the least information.
Q18: When does PCA hurt retrieval quality rather than help it?
▸ Answer
PCA removes the directions of lowest variance. But low-variance dimensions can still be semantically important for retrieval — a dimension that rarely varies across your corpus might be the exact signal that distinguishes relevant from irrelevant documents for a specific query.
- →Problem 1: Top PCs may capture noise or frequency effects (e.g., document length), not semantic content. The first principal component of BERT embeddings often tracks sentence length.
- →Problem 2: PCA is linear — it cannot capture non-linear structure in the embedding manifold.
- →Problem 3: PCA truncation distorts relative distances between documents. Two documents close in the original space may be far apart after truncation if they differ mainly in low-variance dimensions.
- →Fix: Apply PCA after whitening (normalizing variance per dimension). This prevents frequency/length effects from dominating PCs. Alternatively, use PCA only as a post-processing step before HNSW indexing, not for the retrieval representation itself.
Section 8
UMAP — Non-Linear Dimensionality Reduction
UMAP (McInnes et al., 2018) preserves local topological structure using fuzzy simplicial complexes. It constructs a weighted k-NN graph in the high-dimensional space, then minimizes the cross-entropy between the high-d and low-d fuzzy sets via SGD.
| PCA | UMAP | |
|---|---|---|
| Type | Linear | Non-linear |
| New point projection | O(d·k) matrix multiply — fast | Requires re-running on full dataset (Parametric UMAP fixes this) |
| Distance meaning | Approx. preserves global Euclidean structure | Local distances meaningful; global distances are NOT |
| Deterministic | Yes (up to sign flips) | No — random initialization |
| Hyperparameters | k (explained variance threshold) | n_neighbors, min_dist — sensitive, no principled choice |
| Use for | Compression before indexing; production retrieval | Visualization; cluster diagnostics |
Q19 (Advanced): List the specific ways UMAP fails in production retrieval systems that PCA does not. When is Parametric UMAP a valid middle ground?
▸ Answer
Parametric UMAP: trains a neural network to approximate the UMAP embedding function — new points embedded via a forward pass (O(1)). Solves problems 1 and 3, but retains global distance distortion and hyperparameter sensitivity, plus adds neural approximation error.
Decision rule: PCA for production retrieval/compression. UMAP for visualization and cluster diagnostics only. Parametric UMAP for monitoring embedding drift where visual fidelity matters more than retrieval accuracy.
Quick Reference
Method Comparison Table
| Method | Key Mechanism | Weakness | When to Use |
|---|---|---|---|
| Word2Vec (SG+NEG) | Local context prediction; noise contrastive training | No subword; frequency bias in norms | Static word embeddings on large corpora |
| GloVe | Weighted factorization of log co-occurrence matrix | Requires full corpus upfront; no subword | Fixed corpus, fast training, interpretable |
| Sentence Transformers | Siamese BERT with similarity fine-tuning | Expensive encoding; fixed context window | Semantic search, clustering, retrieval |
| SimCLR / InfoNCE | Pull positives, push negatives via temperature softmax | Large batch required; augmentation design critical | Self-supervised vision/NLP pretraining |
| Cosine Similarity | Angle between vectors, norm-invariant | Hubness still present | Default for normalized embeddings |
| Dot Product | Unnormalized inner product | Biased by norm/frequency if not normalized | Normalized embeddings; recommendation systems |
| PCA | Linear projection maximizing variance | Linear only; top PCs may capture noise | Compression before indexing; visualization baseline |
| UMAP | Fuzzy topological structure preservation | No quantitative distance meaning; non-deterministic | Visualization; manifold-aware cluster diagnostics |