> study notes · ml fundamentals

ML Embeddings & Representations

Word2Vec, GloVe, Sentence Transformers, contrastive learning, embedding geometry, similarity metrics, PCA, and UMAP. Every section opens with core theory then moves into worked exercises that test whether you actually understand it.

Word2Vec
GloVe
SBERT
SimCLR
InfoNCE
Hubness
PCA
UMAP
Matryoshka
CLIP

Section 1

Word2Vec

Intuition first

Word2Vec trains a shallow neural network on a simple prediction task: given a word, predict its neighbors (Skip-Gram), or given neighbors, predict the center word (CBOW). The network never actually uses its predictions — the weights themselves become the word vectors. Every word gets two vectors: an input vector v (used at inference) and an output vector u (discarded after training). A score between a center-context pair is just their dot product: v_sat · u_cat. High dot product = model thinks they co-occur.

The full softmax over all V words is too slow (O(V) per step), so Negative Sampling replaces it with binary classification: for each real pair (sat, cat), sample K fake pairs — (sat, the), (sat, mat) — and train the model to score real pairs high and fake pairs low. Cost drops to O(K).

The gradient is the engine. For a real pair (sat→cat): ∂J/∂v_sat = (σ(score) − 1) · u_cat. If the model is wrong (score is low, σ ≈ 0.1), the error is −0.9 — a large nudge pushing v_sat toward u_cat. For a fake pair (sat→the), the gradient pushes v_sat away from u_the. Do this millions of times and vectors self-organize: "cat" and "mat" both appear next to "the" and "a" constantly, so both get nudged toward the same output vectors repeatedly — ending up geometrically close. No labels needed. Similarity emerges purely from shared context.

At inference, cosine similarity between input vectors is your similarity score. The famous result — king − man + woman ≈ queen — works because the "gender" relationship encodes as a consistent linear direction in the space (Levy & Goldberg proved Word2Vec implicitly factorizes the PMI matrix, so linear structure in co-occurrence statistics maps to linear offsets in vector space).

Two things to nail in an interview

1. You use the input matrix V, not U — the output matrix is discarded after training.

2. The noise distribution is P_n(w) = f(w)^(3/4) / Z — raising frequencies to the 3/4 power stops "the" and "a" from dominating negative samples by compressing the frequency range without flattening it entirely.

Skip-Gram Objective

J = \frac{1}{T} \sum_{t=1}^{T} \sum_{\substack{-c \leq j \leq c \\ j \neq 0}} \log P(w_{t+j} \mid w_t)

P(w_O \mid w_I) = \frac{\exp({v'}_{w_O}^{\top} \, v_{w_I})}{\sum_{w} \exp({v'}_w^{\top} \, v_{w_I})}

Negative Sampling (NEG)

Full softmax over vocabulary is O(V) — prohibitive. NEG approximates it with a binary classification task over K noise samples. Noise distribution P_n(w) = U(w)^3/4 / Z downsamples frequent words and upsamples rare ones.

NEG Objective

J_\text{NEG} = \log \sigma({v'}_{w_O}^{\top} v_{w_I}) + \sum_{k=1}^{K} \mathbb{E}_{w_k \sim P_n}\!\left[\log \sigma(-{v'}_{w_k}^{\top} v_{w_I})\right]

\frac{\partial J}{\partial v_c} = \underbrace{(\sigma(u_o \cdot v_c) - 1) \cdot u_o}_{\text{push toward positive}} + \underbrace{\sum_k \sigma(u_{w_k} \cdot v_c) \cdot u_{w_k}}_{\text{push away from negatives}}

Q1: Why does Skip-Gram outperform CBOW on rare words, while CBOW tends to be faster and better on frequent words?

▸ Answer

Skip-Gram makes K separate prediction tasks per word (one per context word). Each rare word gets many gradient updates — one for every window position in which it appears. CBOW averages context vectors before predicting, smoothing the signal. That averaging helps frequent words get stable gradients, but for rare words seen only a few times, the averaging loses information. Skip-Gram preserves the individual word-context relationships, giving each rare word more discriminative signal per occurrence.

In practice: use Skip-Gram when vocabulary has a heavy tail (domain-specific text); use CBOW when speed matters and vocabulary is balanced.

Q2: What bias does Negative Sampling introduce, and how does it affect downstream tasks?

▸ Answer

NEG trains each word pair as an independent binary classification problem rather than multi-class. The learned P(w_O | w_I) is not a valid probability distribution (doesn't sum to 1 over vocab). Frequency bias: words sampled more often as negatives have embeddings pushed away from more other vectors, causing their norms to be systematically smaller — so high-frequency function words cluster near the origin.

Downstream impact: for retrieval using cosine similarity, this bias is largely harmless since cosine normalizes norms. For tasks requiring calibrated probabilities, the embeddings should not be used directly.

Q3: The analogy 'king - man + woman ≈ queen' works in Word2Vec. Derive algebraically what linguistic property the model must have learned.

▸ Answer

The analogy holds when: v(king) - v(man) + v(woman) ≈ v(queen). Rearranging: v(king) - v(queen) ≈ v(man) - v(woman). This means the vector difference between 'king' and 'queen' equals the difference between 'man' and 'woman' — the model encoded the 'gender' concept as a consistent linear direction in embedding space.

Limitations: works ~65% of the time on Google's analogy test set. Fails for irregular mappings, polysemous words, and when the analogy direction interacts with frequency biases.

Q4 (Advanced): Levy & Goldberg (2014) proved Word2Vec with NEG implicitly factorizes the shifted PMI matrix. What does this imply about when analogy arithmetic fails?

▸ Answer

With K negatives, the optimal Word2Vec solution satisfies: v_w · v_c = PMI(w,c) - log K. So Word2Vec performs a low-rank factorization of the shifted PMI matrix — the same family as GloVe. Analogy arithmetic fails when:

→Low-frequency words: PMI estimates are noisy, so the linear direction is unreliable.
→Polysemous words: 'bank' averages over financial/river senses — analogy fails because the vector is not in a consistent semantic subspace.
→Non-linear relationships: antonyms appear in identical contexts ('hot'/'cold'), so they cluster together rather than being linearly separated.
→Domain shift: the gender direction in news text differs from social media — analogies are corpus-specific.

fastText Extension

fastText extends Word2Vec with subword n-gram embeddings: v(w) = mean of all character n-gram vectors. This enables OOV handling (any unseen word decomposes into known n-grams) and improves morphologically rich languages (Turkish, Finnish, German) by sharing parameters across inflected forms.

Section 2

GloVe — Global Vectors for Word Representation

GloVe (Pennington et al., 2014) directly factorizes the global word-word co-occurrence matrix instead of using a predictive local-context model. The intuition: the ratio of co-occurrence probabilities encodes semantic relationships more reliably than raw counts.

GloVe Objective

J = \sum_{i,j=1}^{V} f(X_{ij})\,\bigl(w_i^\top \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij}\bigr)^2

f(x) = \begin{cases} (x/x_{\max})^\alpha & x < x_{\max} \\ 1 & \text{otherwise} \end{cases} \qquad \alpha = 0.75,\; x_{\max} = 100

v_{\text{final}} = w + \tilde{w} \quad \text{(average of word + context vectors)}

Q5: GloVe uses weighted least-squares loss on log co-occurrence counts. Justify every design choice from first principles.

▸ Answer

→Log transformation: Raw counts span many orders of magnitude. Log compresses this range, preventing dominant pairs from drowning out informative rare pairs.
→Weighting f(X_ij): Without weighting, a pair co-occurring 10,000× contributes 10,000× more to the loss. f caps contribution at x_max and uses α=0.75 (empirically better than 1) to soft-cap without hard thresholding.
→Bias terms b_i, b_j: Absorb word-specific frequency effects, freeing vector dimensions to encode purely semantic content.
→Sum of w and w̃: The symmetry X_ij = X_ji means averaging the two embedding matrices exploits this, halving variance without adding bias.

Q6: Compare Word2Vec (Skip-Gram + NEG) and GloVe theoretically and empirically. When would you choose one over the other?

▸ Answer

Dimension	Word2Vec (Skip-Gram + NEG)	GloVe
Training	Online — one pair at a time, SGD	Batch — full co-occurrence matrix, then weighted LS
Weighting	Implicit (sampling frequency)	Explicit via f(X) — more principled
Scaling	Scales better with very large corpora	Faster convergence on fixed corpora
Rare words	Better — noise distribution oversamples rare words	Worse — f(X_ij)→0 for rare pairs, near-zero gradient

Choose GloVe: fixed corpus, train once quickly, corpus <10B tokens. Choose Word2Vec: streaming/growing corpus, incremental fine-tuning, very large data.

Q7 (Advanced): GloVe often trains faster yet struggles more with rare words. Explain both from the objective function.

▸ Answer

Faster training: GloVe trains on non-zero entries of the sparse co-occurrence matrix X — far fewer than T corpus tokens. Plus the weighted LS objective has better curvature than binary classification in NEG, converging in fewer epochs.

Rare word struggle: For rare pairs, X_ij is small. Plugging into f(X_ij) = (X_ij/100)^0.75 ≈ near zero — their gradient contribution is near zero. Word2Vec's noise distribution P_n(w) ∝ f(w)^3/4 deliberately oversamples rare words; a rare word appearing once still gets sampled as a negative many times.

Practical implication: on domain-specific corpora with many rare technical terms (biomedical), Word2Vec generally produces better rare-word embeddings.

Section 3

Sentence Transformers (SBERT)

SBERT (Reimers & Gurevych, 2019) adapts BERT to produce fixed-size sentence embeddings via a pooling layer and contrastive fine-tuning. Vanilla BERT is unsuitable for sentence similarity out of the box: mean-pooling BERT embeddings is outperformed by GloVe mean-pooling, because BERT's pre-training (MLM + NSP) doesn't optimize for geometric similarity.

SBERT Architecture & Training Objectives

SBERT Training Objectives

\text{Classification:} \quad J = -\log \operatorname{softmax}\!\bigl(W \cdot [u;\, v;\, |u-v|]\bigr)

\text{Cosine loss:} \quad J = \operatorname{MSE}\!\bigl(\cos(u,v),\; \text{gold\_score}\bigr)

\text{Triplet loss:} \quad J = \max\!\bigl(0,\; \|u - v_+\| - \|u - v_-\| + \varepsilon\bigr)

Q8: Why is mean-pooling over BERT token embeddings a poor sentence representation without fine-tuning?

▸ Answer

BERT is trained with MLM and NSP. Neither requires sentences with similar meanings to have nearby representations. MLM pushes each token's representation to be predictable from local context — the resulting sentence embedding is dominated by high-frequency tokens like 'the'.

Geometric issue — anisotropy: BERT representations occupy a narrow cone in high-dimensional space. Nearly all sentence embeddings have high cosine similarity (~0.99) with each other regardless of semantic content, because dominant directions are captured by the most frequent tokens. After SBERT fine-tuning, the model learns to spread representations across the cone, making cosine similarity meaningful.

Q9: Build a semantic search system over 100M documents with sub-50ms P99 latency. Full system design.

▸ Answer

→Embedding model: Distilled SBERT (all-MiniLM-L6-v2, d=384). 5× speedup over full BERT with <5% accuracy loss. Quantize to INT8 for 2× memory reduction.
→Offline indexing: Batch encode (GPU cluster, ~1M docs/hour). Store in float16. Build HNSW index (ef_construction=200, M=48) → >95% recall@10 at <5ms.
→Two-stage serving: HNSW returns top-1000 candidates (~5ms) → cross-encoder re-ranks (~20ms). Total <30ms P50.
→Scaling: Shard HNSW across 10 nodes (10M vectors each). Merge top-K results per shard. Use Weaviate/Milvus if incremental updates are needed (HNSW re-indexing is expensive).
→Failure mode: Embedding drift when model updates. Maintain versioned indexes and do zero-downtime blue-green deployment.

Q10: Bi-encoders vs. cross-encoders — derive computational complexity and explain when to use each.

▸ Answer

	Bi-Encoder (SBERT)	Cross-Encoder
Encoding	Query and doc independently	Query + doc concatenated
Pre-compute	Yes — offline document embeddings	No — must run at query time
Query cost	O(1) encoder + O(N·d) dot products	O(N·L²) — N full BERT passes
Accuracy	Lower (no token-level interaction)	Higher (full cross-attention)
Use for	First-stage retrieval (top-K)	Re-ranking the top-K results

Production pattern: bi-encoder retrieves top-100 candidates fast → cross-encoder re-ranks those 100. The recall-speed tradeoff parameter is K — larger K improves recall but increases re-ranking cost.

Matryoshka Representation Learning (MRL)

MRL (Kusupati et al., 2022) trains a single model so any prefix of the embedding is itself a good representation. Loss: L_MRL = Sum_m w_m * L(e_1:m, labels) for m in {64,128,256,512,1024}. A d=1024 MRL embedding can be truncated to d=64 at inference for fast first-stage retrieval — no separate models, no extra storage. MRL at d=128 achieves ~97-99% of the recall@10 of a dedicated d=128 model.

Section 4

Contrastive Learning — SimCLR & InfoNCE

Contrastive learning learns representations by pulling together positive pairs and pushing apart negative pairs, without hand-labeled classes. SimCLR and InfoNCE are the foundational frameworks.

Core Idea

For each anchor, define a positive (same sample, different augmentation) and negatives (different samples). The loss maximizes similarity to the positive while minimizing similarity to all negatives. Structurally identical to cross-entropy softmax, but with no fixed labels — the "correct class" is dynamically defined per anchor.

SimCLR — NT-Xent Loss

Given a batch of N samples, create 2 augmented views each → 2N total. For anchor z_i, its positive is z_j (other view of same sample). Negatives are all other 2N−2 samples in the batch. Loss is computed symmetrically (both z_i→z_j and z_j→z_i) then averaged.

SimCLR NT-Xent Loss

\mathcal{L}_{i,j} = -\log \frac{\exp\!\bigl(\operatorname{sim}(z_i, z_j)/\tau\bigr)}{\displaystyle\sum_{k=1,\, k\neq i}^{2N} \exp\!\bigl(\operatorname{sim}(z_i, z_k)/\tau\bigr)}

\operatorname{sim}(u,v) = \frac{u^\top v}{\|u\|\,\|v\|}, \quad \tau \approx 0.07

→Same encoder for both views
→Nonlinear projection head on top of encoder before loss — key finding from the SimCLR paper
→Requires large batches (4096+) for enough negatives

InfoNCE — as used in MoCo

Query q is contrasted against 1 positive key k⁺ and K negatives from a memory bank. Loss is computed in one direction only (query → key).

InfoNCE Loss (MoCo)

\mathcal{L} = -\log \frac{\exp\!\bigl(\operatorname{sim}(q, k^+)/\tau\bigr)}{\exp\!\bigl(\operatorname{sim}(q, k^+)/\tau\bigr) + \displaystyle\sum_{i=1}^{K}\exp\!\bigl(\operatorname{sim}(q, k_i)/\tau\bigr)}

→Asymmetric: query encoder updated by backprop, key encoder updated by momentum (moving average)
→Memory bank stores negatives from previous batches → K can be 65 536+
→Loss computed in one direction only (query → key)
→Negatives more diverse but slightly stale

SimCLR vs InfoNCE / MoCo

	SimCLR	InfoNCE / MoCo
Negatives	In-batch (2N−2)	Memory bank (65 536+)
Encoder	Symmetric, same weights	Asymmetric, momentum key encoder
Loss direction	Symmetric (both views)	Asymmetric (query → key only)
Projection head	Yes — key contribution of SimCLR paper	MoCo v2+ borrowed this from SimCLR
Compute requirement	Large batch / TPUs needed	Scalable; smaller batches fine

Q11: Derive why temperature τ in the InfoNCE loss matters. What happens as τ → 0 and τ → ∞?

▸ Answer

τ acts as a sharpness parameter on the similarity softmax.

τ → 0:

Softmax → hard argmax. Only the hardest negative receives gradient. Training becomes numerically unstable, causing collapse or very slow convergence.

τ → ∞:

Softmax → uniform over all negatives. Every negative contributes equally, including easy ones. Gradient is spread so thin the model barely learns to discriminate.

Optimal τ (~0.07):

Concentrates gradient on the top few hard negatives while remaining numerically stable. Also implicitly maximizes uniformity of the embedding distribution on the hypersphere (Wang & Isola 2020).

Q12: SimCLR requires very large batch sizes (4096+). Explain why, and describe two methods to make contrastive learning work with small batches.

▸ Answer

The InfoNCE denominator sums over 2N-2 negatives. With small N, the hardest negative is far from the positive (providing weak gradient). Also false negatives become proportionally larger at small N.

→Memory bank (MoCo): Maintain a FIFO queue of embeddings from previous batches (K=65536 negatives without GPU memory overhead). A momentum encoder (β=0.999) ensures queue consistency.
→Hard negative mining: Maintain a FAISS ANN index and mine negatives close to the anchor. A small batch of 64 with hand-picked hard negatives can outperform a random batch of 4096. Apply a similarity threshold to filter false negatives.

Q13: What is representation collapse? How do SimCLR, BYOL, and Barlow Twins each address it?

▸ Answer

Collapse: the encoder maps all inputs to the same constant vector. The loss is minimized trivially — positive and negative pairs are all at the same point.

SimCLRExplicit negatives penalize any two identical embeddings unless they are augmentations of each other. Requires large batches.

BYOLRemoves negatives entirely. Uses online + target (momentum-updated) networks with a predictor. Batch normalization implicitly prevents collapse by whitening the representation distribution.

Barlow TwinsCross-correlation matrix of embeddings from two views should be the identity: diagonal (invariance) = 1, off-diagonal (redundancy) = 0. Directly penalizes dimensional collapse.

Q14 (Advanced): SimCLR discards the projection head g(·) at inference, keeping only f(·). Why does this consistently improve downstream task performance?

▸ Answer

The projection head g(·) must make z = g(f(x)) invariant to augmentations. If augmentations include color jitter, z must discard color information. If they include cropping, z discards spatial layout. The projection head is the layer that discards this information — but augmentation-specific features (color, spatial layout) are often task-relevant.

Information bottleneck view: g maximizes I(z; augmentation-invariant features) while minimizing I(z; augmentation-specific features). f(·) retains the richer, more linearly separable representation. Design implication: the projection head can be aggressive (small bottleneck) without harming downstream performance, since it is discarded.

Section 5

Embedding Space Geometry

Embedding spaces are Euclidean (R^d), but distances, angles, and norms behave in surprising ways at high dimension. Understanding the geometry is critical for debugging retrieval systems.

Isotropy vs. Anisotropy

Isotropic embeddings are uniformly distributed across the unit hypersphere — maximizes the expressiveness of cosine similarity. Anisotropic embeddings (raw BERT) cluster in a narrow cone: most pairs appear highly similar regardless of semantic content. Measure: average cosine similarity of random pairs. Isotropic → ~0. Anisotropic → much greater than 0.

Q15: Explain the 'hubness' phenomenon in high-dimensional spaces. Derive why it occurs and describe two mitigations.

▸ Answer

Hubness: as dimensionality d increases, a small fraction of points become the nearest neighbor of O(N^0.5) other points. This causes top-K retrieval to return the same few hub points for most queries, degrading diversity and accuracy.

Why it occurs: In R^d, for random unit vectors, Var[sim(x,y)] = 1/d. Points near the centroid of the data distribution have systematically smaller average distance to all other points, making them hubs. The effect grows with dimensionality.

→Mitigation 1 — Mutual k-NN: Only count x_j as a neighbor of x_i if x_i is also a neighbor of x_j. Hubs can be neighbors of many, but few are neighbors of hubs. Reduces hubness but reduces recall.
→Mitigation 2 — CSLS score: sim_CSLS(x,y) = 2·cos(x,y) - r_k(x) - r_k(y), where r_k(x) is mean similarity of x to its k nearest neighbors. Normalizes by neighborhood crowdedness, penalizing hubs. Standard in cross-lingual word translation.

Q16: Your embedding model produces L2 norms that correlate strongly with word frequency. What geometric problem does this cause and how do you fix it?

▸ Answer

If ||v_w|| correlates with frequency, then dot product s(u,v) = u^T v ∝ ||u||·||v||·cos(θ). High-frequency words get large dot products with everything, biasing retrieval toward frequent words. Cosine similarity (cosine = u^Tv / (||u||·||v||)) is norm-invariant, so it is unaffected.

→Fix 1: L2-normalize all embeddings before indexing. Projects everything onto the unit hypersphere. Cosine and dot product become equivalent.
→Fix 2: Whitening transform: subtract mean, multiply by Σ^{-1/2}. Projects embeddings into an isotropic space. More expensive but removes both norm bias and anisotropy (effective for post-processing BERT embeddings).

Section 6

Similarity Metrics — When to Use What

Metric	Formula	When to Use	Watch Out For
Cosine	u·v / (\|\|u\|\|·\|\|v\|\|)	Default for normalized embeddings; norm-invariant	Hubness still present; doesn't use magnitude info
Dot Product	u·v	Normalized embeddings; recommendation systems (MIPS)	Biased by norm/frequency if embeddings not normalized
Euclidean	\|\|u - v\|\|₂	After PCA/UMAP; metric-space trained embeddings	Curse of dimensionality; bad for raw high-d embeddings

Q17: When does dot product outperform cosine similarity? Derive the condition formally.

▸ Answer

Dot product = ||u||·||v||·cos(θ). Cosine = cos(θ). Dot product adds the magnitude term ||u||·||v||. So dot product outperforms cosine when magnitude carries task-relevant signal.

This happens in recommendation systems: item popularity (encoded in the embedding norm) is itself a signal — a popular item should rank higher all else equal. In MIPS (Maximum Inner Product Search), you explicitly want the item with the highest norm-weighted similarity.

Condition: use dot product when embeddings are L2-normalized (then dot product = cosine, no difference) OR when magnitude encodes task-relevant information. Use cosine when you want pure directional similarity independent of training-time frequency effects.

Section 7

PCA — Dimensionality Reduction for Embeddings

PCA finds orthogonal directions of maximum variance in the embedding space. The first k principal components capture the most variance, compressing the embedding while losing the least information.

PCA

\Sigma = \frac{1}{N} X^\top X \quad [d \times d]

\Sigma = V \Lambda V^\top \quad \text{(eigendecomposition)}

x_k = V_k^\top x \quad \text{(k-dimensional projection)}

\text{Explained variance ratio:} \quad \frac{\lambda_i}{\sum_j \lambda_j}

Q18: When does PCA hurt retrieval quality rather than help it?

▸ Answer

PCA removes the directions of lowest variance. But low-variance dimensions can still be semantically important for retrieval — a dimension that rarely varies across your corpus might be the exact signal that distinguishes relevant from irrelevant documents for a specific query.

→Problem 1: Top PCs may capture noise or frequency effects (e.g., document length), not semantic content. The first principal component of BERT embeddings often tracks sentence length.
→Problem 2: PCA is linear — it cannot capture non-linear structure in the embedding manifold.
→Problem 3: PCA truncation distorts relative distances between documents. Two documents close in the original space may be far apart after truncation if they differ mainly in low-variance dimensions.
→Fix: Apply PCA after whitening (normalizing variance per dimension). This prevents frequency/length effects from dominating PCs. Alternatively, use PCA only as a post-processing step before HNSW indexing, not for the retrieval representation itself.

Section 8

UMAP — Non-Linear Dimensionality Reduction

UMAP (McInnes et al., 2018) preserves local topological structure using fuzzy simplicial complexes. It constructs a weighted k-NN graph in the high-dimensional space, then minimizes the cross-entropy between the high-d and low-d fuzzy sets via SGD.

	PCA	UMAP
Type	Linear	Non-linear
New point projection	O(d·k) matrix multiply — fast	Requires re-running on full dataset (Parametric UMAP fixes this)
Distance meaning	Approx. preserves global Euclidean structure	Local distances meaningful; global distances are NOT
Deterministic	Yes (up to sign flips)	No — random initialization
Hyperparameters	k (explained variance threshold)	n_neighbors, min_dist — sensitive, no principled choice
Use for	Compression before indexing; production retrieval	Visualization; cluster diagnostics

Q19 (Advanced): List the specific ways UMAP fails in production retrieval systems that PCA does not. When is Parametric UMAP a valid middle ground?

▸ Answer

1.Not parametric by default: You cannot embed a new point without re-running the full algorithm (O(N log N) per batch). For a streaming system ingesting 1M new embeddings/day, completely infeasible. PCA: new point projection is a single O(d·k) matrix multiply.

2.Global distances are meaningless: UMAP optimizes local neighborhood preservation. Two semantically close clusters can appear far apart in UMAP space. Any retrieval using UMAP distances for cross-cluster queries will silently fail. PCA preserves global Euclidean structure (approximately).

3.Non-deterministic: Random initialization produces different layouts each run. Breaks reproducibility requirements in regulated industries (finance, healthcare).

4.Hyperparameter sensitivity: n_neighbors and min_dist dramatically alter cluster structure with no principled way to choose them for retrieval quality.

Parametric UMAP: trains a neural network to approximate the UMAP embedding function — new points embedded via a forward pass (O(1)). Solves problems 1 and 3, but retains global distance distortion and hyperparameter sensitivity, plus adds neural approximation error.

Decision rule: PCA for production retrieval/compression. UMAP for visualization and cluster diagnostics only. Parametric UMAP for monitoring embedding drift where visual fidelity matters more than retrieval accuracy.

Quick Reference

Method Comparison Table

Method	Key Mechanism	Weakness	When to Use
Word2Vec (SG+NEG)	Local context prediction; noise contrastive training	No subword; frequency bias in norms	Static word embeddings on large corpora
GloVe	Weighted factorization of log co-occurrence matrix	Requires full corpus upfront; no subword	Fixed corpus, fast training, interpretable
Sentence Transformers	Siamese BERT with similarity fine-tuning	Expensive encoding; fixed context window	Semantic search, clustering, retrieval
SimCLR / InfoNCE	Pull positives, push negatives via temperature softmax	Large batch required; augmentation design critical	Self-supervised vision/NLP pretraining
Cosine Similarity	Angle between vectors, norm-invariant	Hubness still present	Default for normalized embeddings
Dot Product	Unnormalized inner product	Biased by norm/frequency if not normalized	Normalized embeddings; recommendation systems
PCA	Linear projection maximizing variance	Linear only; top PCs may capture noise	Compression before indexing; visualization baseline
UMAP	Fuzzy topological structure preservation	No quantitative distance meaning; non-deterministic	Visualization; manifold-aware cluster diagnostics