Back$ kuntal_pal

> study notes · amazon applied scientist ii

RAG Architecture & Evaluation

Comprehensive notes covering retrieval-augmented generation end-to-end: pipeline architecture, chunking strategies, embedding models, vector databases, retrieval methods, evaluation metrics, failure modes, and production considerations.

  • RAG
  • Embeddings
  • Vector DBs
  • BM25
  • NDCG
  • RAGAS
  • Agentic RAG
  • Production ML

Background

Why LLMs Hallucinate: 3-Point Summary

01

Next-Token Prediction Without Ground Truth or Labels

LLMs work by predicting the next word based on patterns in pretraining data — but crucially, there are no true/false labels. The model only learns to mimic the language distribution of fluent text, not to distinguish valid facts from false ones. This means hallucinations are built in: when uncertain, it generates whatever sounds most plausible statistically, with no mechanism to verify against reality.

02

Training Data Contains Noise and Incomplete Coverage

Models learn from training data that includes misinformation, outdated facts, rare knowledge gaps, and conflicting sources. They can't distinguish between common in my training and true, so they hallucinate when filling gaps with statistically plausible but incorrect information.

03

Evaluation Metrics Reward Guessing Over Uncertainty

Standard benchmarks measure accuracy only, which incentivizes models to guess rather than admit uncertainty — guessing a birthday has a 1-in-365 chance of being right, while saying I don't know scores zero. This creates perverse incentives during training where confident hallucinations are rewarded over honest abstention.

RAG addresses the gap these failures expose.

By grounding the model in externally retrieved documents at inference time, RAG replaces statistical guessing with evidence-backed generation — and makes answers auditable.

Section 1

What is RAG and Why It Exists

Retrieval-Augmented Generation (RAG) is a framework that enhances LLMs by grounding their outputs in externally retrieved documents. Instead of relying purely on parametric knowledge baked into model weights during training, RAG retrieves relevant context at inference time and injects it into the prompt.

The Core Problem RAG Solves

  • Knowledge cutoff: LLMs have a training cutoff date and cannot answer questions about recent events without retrieval.
  • Hallucination: LLMs generate plausible-sounding but factually incorrect text. Retrieved documents ground the answer in real sources.
  • Domain specificity: General-purpose LLMs lack proprietary or niche domain knowledge (e.g., sustainability emissions factors, internal policy documents).
  • Auditability: RAG outputs can be traced back to source documents — critical for sustainability claims verification.

RAG Generations

GenerationKey Characteristics
Naive RAGIndex → Retrieve → Generate. Simple pipeline, brittle on complex queries.
Advanced RAGPre-retrieval (query rewriting) + post-retrieval (re-ranking, filtering). Better precision.
Modular RAGPlug-and-play components. Routing, fusion, self-correction loops. Most production systems.

Interview Answer

RAG combines the parametric knowledge of an LLM with non-parametric retrieval from an external corpus. This lets the model answer questions about recent, proprietary, or domain-specific information while reducing hallucination by grounding responses in retrieved evidence.

Decision Framework

RAG vs Fine-tuning

One of the most common interview questions in applied LLM roles. The answer isn't either/or — but knowing when each is the right tool matters.

✓ Choose RAG when

  • Knowledge changes frequently — re-indexing is cheap; retraining is expensive.
  • You need source attribution — answers can be traced back to specific documents. Critical for claims verification.
  • Domain knowledge is large — fine-tuning can't realistically encode millions of documents into weights.
  • You need to reduce hallucination — grounding in retrieved documents is more reliable than parametric memory.
  • Limited compute budget — no GPU training required, just inference + vector search.

✓ Choose Fine-tuning when

  • You need a specific style or output format — RAG can't change how the model writes.
  • Teaching a new task or behavior — structured JSON extraction, domain-specific reasoning, custom instruction formats.
  • Specialized vocabulary — fine-tuning embeds domain terms into weights that the base model doesn't understand.
  • Latency is critical — no retrieval step means faster inference.
  • Knowledge is stable and bounded — if facts don't change, baking them into weights is fine.

In practice — the answer is usually both

Fine-tune the model to behave correctly in your domain — output format, reasoning style, instruction following. Then add RAG on top to supply current, specific factual knowledge. This is the production pattern at most mature ML teams.

Section 2

RAG Pipeline Architecture — End to End

A production RAG pipeline has two distinct phases: offline indexing and online retrieval-generation.

Phase 1 — Offline Indexing

01Document ingestion: Load raw documents (PDFs, HTML, JSON, CSVs, internal databases).
02Preprocessing: Clean text — remove boilerplate, normalize whitespace, handle special characters.
03Chunking: Split documents into smaller, semantically coherent pieces.
04Embedding: Encode each chunk into a dense vector using an embedding model.
05Indexing: Store vectors in a vector database with metadata (source, date, document ID, chunk ID).

Phase 2 — Online Retrieval-Generation

01Query processing: Optionally rewrite or expand the user query.
02Query embedding: Encode the query using the same embedding model used at index time.
03Retrieval: Perform ANN search to find top-K relevant chunks.
04Re-ranking: Apply a cross-encoder to re-score top-K candidates more precisely.
05Context assembly: Select and format retrieved chunks into a prompt context window.
06Generation: Pass query + context to the LLM. The model generates a grounded answer.
07Post-processing: Citation extraction, answer validation, faithfulness checking.

Key Insight

The embedding model used at query time MUST be identical to the one used at indexing time. Switching embedding models requires re-indexing the entire corpus.

Section 3

Chunking Strategies

Chunking is one of the highest-impact decisions in RAG. Too large = diluted relevance signal, context overflow. Too small = loses context, incomplete answers.

Fixed-Size Chunking

Split document every N tokens (e.g., 512) with optional overlap (e.g., 50 tokens). Simple, fast, predictable chunk sizes — but cuts mid-sentence or mid-paragraph, breaking semantic units. Use when documents are uniform and structured (database records, short product descriptions).

Sentence-Based Chunking

Use sentence boundary detection (spaCy, NLTK) to split at natural breaks. Preserves semantic units better than fixed-size, but chunks vary widely in size and very short sentences create noisy small chunks.

Semantic / Embedding-Based Chunking

Embed each sentence and group consecutive sentences whose embeddings are similar (cosine similarity above threshold). Split when similarity drops sharply. Produces semantically coherent chunks but is computationally expensive. Best for long heterogeneous documents (research papers, sustainability reports, legal docs).

Hierarchical / Parent-Child Chunking

Index small child chunks (e.g., 128 tokens) for retrieval precision but return larger parent chunks (e.g., 512 tokens) to the LLM for context richness. Best of both worlds — precise retrieval + rich context for generation. Store parent_id as metadata on each child chunk; fetch parent on retrieval.

Document-Structure-Aware Chunking

Use document structure signals — headings, sections, tables, figures — to define chunk boundaries. Markdown: split at ## heading boundaries. PDF: use layout analysis to detect section breaks. HTML: use tag structure (div, section, article) as split points.

Strategy Comparison

StrategyBest ForWatch Out For
Fixed-sizeUniform short docsMid-sentence cuts
Sentence-basedNarrative textSize variance
SemanticResearch papers, reportsCompute cost
HierarchicalProduction RAGComplex retrieval logic
Structure-awarePDFs, HTML, MarkdownParser fragility

Chunk Size Matters More Than People Think

Too small (64–128 tokens)

Each chunk lacks context. "It increased 15% last quarter" means nothing without knowing what "it" refers to.

Too large (2048+ tokens)

Each chunk covers multiple topics, diluting relevance. When you search for revenue data, you get a chunk that's 10% about revenue and 90% about headcount.

Sweet spot (256–512 tokens)

Enough context to be self-contained, focused enough to be relevant. Most production RAG systems use 256–512 token chunks with 50-token overlap. Anthropic's RAG guidelines recommend this range.

Interview Answer — Chunking Trade-off

Smaller chunks improve retrieval precision because the embedding focuses on a narrow topic. But smaller chunks give the LLM less context, risking incomplete answers. Hierarchical chunking is the production best practice: retrieve small, generate with large.

Section 4

Embedding Models

Embedding models convert text into dense vectors. The quality of embeddings directly determines retrieval quality — garbage embeddings mean garbage retrieval, no matter how good your LLM is.

Evolution of Text Embeddings

  • TF-IDF / BM25 (sparse): Bag-of-words representation. High-dimensional sparse vectors. No semantic understanding — 'car' and 'automobile' are unrelated.
  • Word2Vec / GloVe (dense, word-level): Each word gets a fixed vector learned from co-occurrence statistics. Cannot handle polysemy ('bank' = riverbank or financial bank?). No context-sensitivity.
  • BERT-based (contextual, token-level): Each token gets a vector that depends on surrounding context. Handles polysemy. But designed for classification, not retrieval.
  • Sentence Transformers / SBERT: Fine-tuned BERT with siamese/triplet network to produce sentence-level embeddings optimized for semantic similarity. State of the art for RAG retrieval.
  • OpenAI text-embedding-3: API-based, 1536 or 3072 dimensions, strong performance. No control over training, cost per token.

Bi-Encoder vs. Cross-Encoder

Bi-EncoderCross-Encoder
Encodes query and document independentlyEncodes query and document together (concatenated)
Can pre-compute document embeddings offlineMust run at query time — cannot pre-compute
O(1) query time with vector DB (ANN search)O(N) — must score every candidate
Less accurate — context-free comparisonMore accurate — attends to query-doc interaction
Use for: first-stage retrieval (top-K)Use for: re-ranking the top-K results

Key Interview Point

Bi-encoders and cross-encoders are almost always used together. Bi-encoder retrieves top-100 candidates fast. Cross-encoder re-ranks those 100 to find the best 5. This is the standard two-stage retrieval pipeline.

Contrastive Learning Objectives

  • InfoNCE loss: For a positive pair (query, relevant doc), maximize similarity relative to N negative pairs. Softmax over all pairs. Used in DPR, E5.
  • Triplet loss: Anchor + positive + negative. Minimize distance(anchor, positive) minus distance(anchor, negative) + margin.
  • SimCLR: Self-supervised contrastive learning. Augmentations of the same input are positive pairs. Useful for multimodal embeddings.

Similarity Metrics

MetricWhen to Use
Cosine similarityWhen magnitude doesn't matter — only direction. Works well for normalized embeddings. Most common for text retrieval.
Dot productWhen magnitude matters. Faster than cosine (no normalization). Used in MIPS (Maximum Inner Product Search).
Euclidean distanceWhen absolute position in space matters. Less common for text. Used in image embeddings.

Dimensionality Reduction

  • PCA: Linear projection onto principal components. Preserves global variance. Fast but loses non-linear structure.
  • UMAP: Non-linear. Preserves local neighborhood structure better than t-SNE. Good for visualization and approximate compression.
  • Matryoshka embeddings: Train embeddings so that the first D dimensions are a good representation — allows truncating at inference time for speed/storage trade-offs.

Section 5

Vector Databases — Trade-offs

Vector databases are purpose-built for Approximate Nearest Neighbor (ANN) search over dense embeddings. The key trade-off in all of them: recall vs. speed vs. memory.

Core ANN Index Types

  • HNSW (Hierarchical Navigable Small World): Graph-based index. High recall, fast queries, but high memory usage. Default in Weaviate, Qdrant, Milvus. Best general-purpose choice.
  • IVF (Inverted File Index): Cluster-based. Divides space into Voronoi cells. Fast build, lower memory. Must tune nprobe (how many clusters to search). Used in FAISS.
  • PQ (Product Quantization): Compresses vectors by quantizing sub-spaces. 8–32x compression. Combined with IVF as IVF+PQ for large-scale deployment. Lower recall.
  • Flat Index: Exact brute-force search. Perfect recall. Only viable for small corpora (< 1M vectors).

Vector Database Comparison

DBBest ForKey Trade-off
FAISSResearch, offline batch search, when you own infraNo built-in persistence/filtering; library not a server
PineconeManaged cloud, fast setup, production SaaSCost at scale; less control over index tuning
WeaviateHybrid search (dense + sparse), filtering, GraphQL APIHigher ops complexity; self-hosted by default
ChromaLocal dev, prototyping, small-scale appsNot production-grade at large scale
pgvectorAlready using PostgreSQL; want SQL + vector in one placeSlower than dedicated vector DBs at scale
QdrantHigh-performance, Rust-based, great filteringNewer, smaller ecosystem

Key Parameters to Tune

  • ef_construction (HNSW): Controls index build quality. Higher = better recall, slower build.
  • ef (HNSW query): Controls search quality at query time. Higher = better recall, slower query.
  • nprobe (IVF): Number of clusters searched per query. Higher = better recall, slower.
  • m (HNSW): Number of connections per node. Higher = better recall, more memory.

Interview Answer — Vector DB Choice

For prototyping: Chroma or FAISS. For production with filtering and hybrid search: Weaviate or Qdrant. If the team is already on AWS/GCP with managed infra preferences: Pinecone. The key question is: do you need metadata filtering alongside vector search? If yes, avoid FAISS (no filtering) and use a server-based DB.

Section 6

ANN Algorithms — How Vector Search Actually Works

Every vector database under the hood uses an Approximate Nearest Neighbor (ANN) index. Understanding these algorithms is essential for production RAG — they control the recall/speed/memory triangle that governs retrieval performance.

Why ANN Exists — The Scaling Problem

The naive approach — compare the query against every stored vector — is called exact nearest neighbor search.

Exact Search Complexity

N = number of vectors · D = embedding dimension

Cost per query = O(N × D)

OpenAI text-embedding-3-large: N=1M docs, D=3,072

Operations per query = 3,000,000,000 — infeasible at scale

The ANN Solution

Accept a small, controlled drop in recall (e.g., 95–99% instead of 100%) in exchange for 10–1000× speed improvement. Build data structures that skip the vast majority of vectors and only check a small promising subset.

The Core Trade-off Triangle

DimensionWhat It MeansHow to Tune
Recall% of true nearest neighbors found. 100% = exact search.Increase ef (HNSW), nprobe (IVF), L (LSH)
SpeedQuery latency — how fast you get an answer.Decrease ef, nprobe, or L
MemoryRAM needed to store the index.Use PQ compression, reduce M (HNSW)

KD-Trees — Space Partitioning

Recursively partitions the vector space by splitting on one dimension at a time, building a binary tree. At query time, traverse the tree and prune branches that cannot possibly contain a nearer neighbor.

  • Building: At each node, split on the dimension with highest variance. Recursively partition left/right subtrees until leaf size is small.
  • Pruning rule: Skip a subtree if the distance from the query to the split hyperplane is greater than the current best distance.

Curse of Dimensionality

In high dimensions, points are spread so far apart that the distance to any split hyperplane is almost always smaller than the best distance found — so almost no branches get pruned. The tree degenerates to brute force.

D < 20 → KD-tree works well

~ D = 50 → starts degrading

D > 100 → essentially brute force

Modern embeddings: D = 384–3072 → useless

Interview Answer — KD-Trees

KD-trees work great for GPS coordinates or simple feature vectors (D<20). Modern text embeddings have D=768–3072. The curse of dimensionality means almost no branches get pruned — the tree degenerates to brute force. HNSW or IVF are the right choices for embedding search.

Ball Trees — Hypersphere Partitioning

Like KD-trees but partition space into hyperspheres instead of hyperplanes. Handles non-axis-aligned clusters better and works in slightly higher dimensions.

  • Building: Each node defines a ball: (center=centroid, radius=max distance to any point). Split by picking two furthest seeds, assign points to nearest seed, recurse.
  • Pruning rule: Skip a ball if: dist(query, center) - radius > best_dist_so_far. I.e., even the closest possible point in the ball can't beat the current best.
Ball TreesKD-Trees
Partition into hyperspheresPartition into axis-aligned hyperrectangles
Better for non-axis-aligned clustersBetter for axis-aligned data
Works up to D ≈ 50Works up to D ≈ 20
Slower to buildFaster to build
Both fail above D ≈ 50–100 — use HNSWBoth fail above D ≈ 50–100 — use HNSW

LSH — Locality Sensitive Hashing

Design hash functions where similar vectors are more likely to produce the same hash value. Opposite of cryptographic hashes which deliberately make similar inputs produce different hashes.

  • Random Projection LSH: Project vectors onto random unit vectors. Similar vectors land near each other on the line. Divide into buckets — similar vectors fall into the same bucket.
  • K hash functions per table: Use K hash functions and require all K to agree. Higher K → fewer false positives but more false negatives (similar vectors split across bucket boundaries).
  • L tables: Repeat with L independent sets of K hash functions. At query time, look up in all L tables and union candidates. Higher L → better recall, more memory.

LSH Variants

Random Projection LSH: For cosine similarity. Project onto random lines.

MinHash LSH: For Jaccard similarity (set overlap). Near-duplicate document detection.

SimHash: Google's near-duplicate web page detection. Compact binary signature, small Hamming distance = near-duplicate.

Interview Answer — LSH

LSH uses random projections to create hash functions where similar vectors land in the same bucket. Multiple hash tables (L tables, K hashes each) improve recall. HNSW generally outperforms LSH on recall-speed for dense embeddings, but LSH remains the standard for near-duplicate detection and set-similarity tasks (MinHash).

IVF — Inverted File Index (KMeans-Based)

Divide the vector space into k clusters using KMeans. Each vector is assigned to its nearest centroid. At query time, find the nearest cluster centroids and only search vectors within those clusters.

  • Building: Run KMeans with k clusters on all N vectors. Build inverted lists: inverted_list[i] = all vectors assigned to cluster i. Build time: O(N × k × D × iterations).
  • nprobe: The main recall-speed knob. Controls how many clusters you search per query. nprobe=1 → fastest, ~80-90% recall. nprobe=k → exact search (same as brute force).
nprobeRecallSpeed
1~80–90% (misses cluster boundary vectors)Fastest
4~95% (covers most boundary cases)Fast
16~99% (near-exact)Moderate
k (all clusters)100% (exact search)Same as brute force

IVF + PQ — Scaling to Billions

Product Quantization (PQ) compresses vectors to save memory. Split each D-dim vector into M sub-vectors, run KMeans (256 centroids) on each sub-space, replace each sub-vector with its nearest centroid ID (1 byte).

D=128, M=8: Original = 128×4 = 512 bytes → PQ = 8 bytes → 64× compression

Trade-off: 64× memory savings at the cost of some recall drop (IVF+PQ: 90–95% vs IVF: 95–99%).

Interview Answer — IVF

IVF partitions the vector space into k clusters using KMeans. At query time, you only search the nprobe nearest clusters, skipping the rest. nprobe is the recall-speed knob — higher nprobe = better recall but slower. IVF+PQ adds compression for billion-scale search. I'd use IVF when memory is constrained and HNSW when I need the best recall-speed trade-off.

HNSW — Hierarchical Navigable Small World

The best general-purpose ANN algorithm for production. Used by Weaviate, Qdrant, Milvus, and FAISS. Builds a multi-layer graph where upper layers have sparse long-range connections for fast navigation, lower layers have dense short-range connections for precise search.

  • Layer assignment: When inserting a vector, randomly assign its max layer using exponential decay: max_layer = floor(-ln(random()) × mL). ~63% of vectors land only in Layer 0. A vector appears in ALL layers 0 to max_layer.
  • Upper layers: Express highways — sparse, long-range connections. Lets you navigate to the right region of the space in a few hops.
  • Layer 0: Local streets — dense, short-range connections. Precise search within the right neighborhood.
  • ef parameter: Controls how many candidates you track during search. ef=50 is typical production starting point. ef must be ≥ k. Tune this first — it's the main production knob.

HNSW Parameters — Complete Reference

ParameterDefaultEffect
M16Connections per node. Higher = better recall, more memory, slower build.
ef_construction200Build-time search width. Higher = better index quality, slower build.
ef (query)50Query-time search width. THE main production tuning knob.
Mmax32 (2×M)Max connections at Layer 0. Usually left at 2×M.

Interview Answer — HNSW

HNSW builds a hierarchical graph. Upper layers are sparse with long-range connections — express highways to navigate to the right region quickly. Lower layers are dense with local connections for precise search. Layer assignment is random with exponential decay, so most vectors are only in Layer 0. At query time you navigate greedily top-to-bottom, checking only ~30–50 nodes out of millions. The recall-speed trade-off is controlled by ef — tune this at inference time against a recall benchmark until you hit the latency budget.

Algorithm Comparison

AlgorithmRecallSpeedMemoryBest Use Case
Exact / Flat100%O(N×D) slowestLowCorpus <100K, eval baseline
KD-Tree100%Fast (D<20), brute force (D>50)LowGPS, low-D geometric data
Ball Tree100%Fast (D<50), slow (D>100)LowNon-axis-aligned clusters, D<50
LSH~85–95%FastMediumNear-duplicate detection, set similarity
IVF~95–99%FastMediumLarge corpus, memory-constrained
IVF+PQ~90–95%Very fastVery low (64×)Billion-scale, extreme memory constraints
HNSW~97–99%Very fastHighGeneral production RAG — best default

Decision Guide

D < 20? KD-Tree or Ball Tree

D > 20 and corpus < 100K? Exact search (FAISS flat index)

Corpus > 100K and memory constrained? IVF or IVF+PQ

Corpus > 100K and memory available? HNSW (best default for production RAG)

Near-duplicate detection? LSH (MinHash for text)

Corpus > 1 billion vectors? IVF+PQ or HNSW+PQ

Section 7

Retrieval Strategies

Sparse Retrieval — TF-IDF and BM25

Sparse retrieval represents documents as high-dimensional sparse vectors (one dimension per vocabulary term). No ML required.

  • TF-IDF: Term Frequency × Inverse Document Frequency. Upweights rare terms that are specific to a document. Downweights common words. Simple but no length normalization.
  • BM25: The dominant sparse retrieval model. Extends TF-IDF with saturation (TF gains diminish as term frequency increases) and length normalization. Parameters: k1 (TF saturation, typically 1.2–2.0) and b (length normalization, typically 0.75).
BM25 Formula
score(D,Q)=iIDF(qi)f(qi,D)(k1+1)f(qi,D)+k1 ⁣(1b+bDavgdl)\text{score}(D,Q) = \sum_i \text{IDF}(q_i) \cdot \frac{f(q_i,D)\cdot(k_1+1)}{f(q_i,D) + k_1\!\left(1 - b + b\cdot\dfrac{|D|}{\text{avgdl}}\right)}

Strength: Exact keyword matching — critical when domain has specialized terminology (emissions factor codes, regulatory identifiers, chemical names). Weakness: No semantic understanding. “CO₂ emissions” and “carbon footprint” are unrelated to BM25.

Dense Retrieval

Embed both query and documents into the same dense vector space. Retrieve by nearest neighbor search. Captures semantic similarity that sparse methods miss.

  • DPR (Dense Passage Retrieval): Two separate BERT encoders — one for questions, one for passages — trained with contrastive loss on QA pairs.
  • ColBERT: Late interaction model. Stores one vector per token. At query time, computes MaxSim — max similarity between each query token and all document tokens. Higher recall than single-vector methods but larger index.

Hybrid Retrieval

Combine sparse (BM25) and dense (embedding) retrieval for best of both worlds.

  • Reciprocal Rank Fusion (RRF): Combine rankings from multiple retrieval systems without needing score normalization. score(d) = Σ 1/(k + rankᵢ(d)) where k=60 is a smoothing constant. Simple and effective.
  • Linear combination: Weighted sum of normalized BM25 and cosine similarity scores. Requires careful normalization and weight tuning.

Interview Answer — When to Use Hybrid

In production I almost always default to hybrid retrieval. Dense retrieval is great for semantic queries but misses exact matches. BM25 is great for technical terms, product codes, or regulatory IDs that must match exactly. RRF fusion is my default because it requires no score normalization.

Re-ranking

  • Cross-encoder re-ranking: Concatenate query + document, run through BERT. Produces a single relevance score. O(K) inference calls at query time. High quality but slow.
  • ColBERT re-ranking: More efficient late-interaction. Precompute token embeddings offline, do MaxSim at query time.
  • LLM re-ranking: Use the LLM itself to judge relevance. Highest quality, highest latency.

Query Expansion

  • HyDE (Hypothetical Document Embeddings): Use the LLM to generate a hypothetical answer to the query, then retrieve using that answer. The hypothesis is often closer to real documents in embedding space than the original question.
  • Multi-query: Generate multiple paraphrases of the query. Retrieve for each. Union the results. Improves recall.
  • Step-back prompting: Generalize the specific query to a broader question before retrieval. Useful when the question is too specific to match documents.

Section 8

Information Retrieval Evaluation Metrics

Be able to compute NDCG by hand. These metrics are standard in IR and search evaluation, and directly apply to measuring RAG retrieval quality.

Precision@K and Recall@K

  • Precision@K: Of the K retrieved documents, what fraction are relevant? P@K = (# relevant in top K) / K
  • Recall@K: Of all relevant documents in the corpus, what fraction appear in the top K? R@K = (# relevant in top K) / (total relevant)

Example

Corpus has 10 relevant docs. You retrieve 5. 3 of those 5 are relevant. P@5 = 3/5 = 0.6. R@5 = 3/10 = 0.3.

Mean Reciprocal Rank (MRR)

For queries where only the first relevant result matters. RR = 1 / rank_of_first_relevant_doc. MRR = mean over all queries.

Example

Query 1: first relevant doc at rank 2 → RR = 0.5. Query 2: first relevant doc at rank 1 → RR = 1.0. MRR = (0.5 + 1.0) / 2 = 0.75.

Use MRR when: You only care about the single best answer (e.g., factoid QA).

Mean Average Precision (MAP)

Extends Precision@K to consider the order of all relevant documents. AP = mean of P@K at each rank where a relevant doc appears.

Example

5 retrieved docs. Relevant ones at ranks 1, 3, 5. P@1 = 1.0, P@3 = 0.67, P@5 = 0.6. AP = (1.0 + 0.67 + 0.6) / 3 = 0.76. MAP = mean AP across queries.

Use MAP when: You want to reward systems that rank ALL relevant documents highly, not just the first one.

Normalized Discounted Cumulative Gain (NDCG)

The most comprehensive IR metric. Handles graded relevance (highly relevant, somewhat relevant, irrelevant — not just binary).

  • DCG: Σ (2^relᵢ − 1) / log₂(i + 1) where relᵢ is the relevance score at rank i. Higher ranks get more weight.
  • IDCG: Ideal DCG — the DCG you'd get if results were in perfect order (most relevant first).
  • NDCG: DCG / IDCG. Normalized to [0, 1]. 1.0 = perfect ranking.

Hand Calculation Example

Relevance scores at ranks 1–4: [3, 2, 3, 0]

DCG = 7/log₂(2) + 3/log₂(3) + 7/log₂(4) + 0 = 7 + 1.89 + 3.5 = 12.39

Ideal order [3,3,2,0]: IDCG = 7 + 7/1.585 + 3/2 + 0 = 12.92

NDCG = 12.39 / 12.92 = 0.96

Use NDCG when: Relevance is graded — the standard metric for search and recommendation systems.

Section 9

RAG Evaluation — RAGAS Framework

Evaluating RAG requires assessing both retrieval quality AND generation quality. RAGAS (RAG Assessment) is the standard framework for end-to-end evaluation.

The Four Core RAGAS Metrics

MetricWhat It MeasuresWhat It Catches
FaithfulnessDo generated claims appear in retrieved context? Score = (# claims supported by context) / (total claims in answer).Hallucination — model adding facts not in context
Answer RelevanceDoes the answer address the question asked? Measured by generating reverse-questions from the answer.Off-topic answers — model answering a different question
Context PrecisionOf retrieved chunks, what proportion are actually relevant to answering the question?Noisy retrieval — irrelevant chunks confusing the model
Context RecallOf the claims in the ground-truth answer, what fraction can be attributed to retrieved context?Incomplete retrieval — missing chunks containing needed info

Answer Correctness (vs. Ground Truth)

When you have labeled ground-truth answers: F1 score between generated answer and reference answer at the semantic level (using embedding similarity), not exact string match.

Component-Level Evaluation

  • Retrieval-only eval: Evaluate using NDCG, MAP, MRR against labeled relevant docs. Isolates retrieval quality from generation quality.
  • Generation-only eval: Feed gold-standard retrieved context, evaluate output only. Isolates LLM quality from retrieval.
  • End-to-end eval: Full pipeline evaluation. Harder to debug when it fails.

Interview Answer — How to Evaluate a RAG System

I evaluate RAG at two levels. First, retrieval quality using NDCG@10 against a labeled eval set — this tells me if the right chunks are being retrieved. Second, end-to-end using RAGAS metrics: faithfulness to catch hallucination, context precision to catch noisy retrieval, and answer relevance to catch topic drift. If faithfulness drops, I look at the generation prompt. If context precision drops, I look at chunking or retrieval.

Section 10

RAG Failure Modes and How to Fix Them

1

Retrieval Failure — Wrong Chunks Retrieved

The most common failure. The model generates a confident-sounding answer from irrelevant context.

Diagnosis: Low Context Precision or Context Recall in RAGAS. Check retrieval logs — are the right source documents appearing?

Fix: Fine-tune embedding model on domain data; use hybrid retrieval (BM25 + dense); improve chunking; add metadata filtering to narrow the search space.

2

Hallucination — Model Adds Facts Not in Context

The LLM ignores retrieved context and generates from its parametric memory, or blends retrieved facts with invented ones.

Diagnosis: Low Faithfulness score in RAGAS. Cross-reference each claim in the answer against retrieved chunks.

Fix: Stronger system prompt ('Answer ONLY using the provided context. If unsure, say I don't know.'); add citation enforcement — require the model to cite chunk IDs for each claim.

3

Context Window Overflow

Retrieved chunks exceed the LLM's context window. Later chunks are silently truncated.

Diagnosis: Long documents, high K. Check token counts before sending to LLM.

Fix: Reduce K; use smaller chunks; apply re-ranking and keep only top-3; use models with longer context windows; apply 'Lost in the Middle' awareness — most important context at start or end.

4

Semantic Drift / Query-Document Mismatch

The query is semantically distant from how information is expressed in documents. E.g., user asks a question but documents contain statements.

Diagnosis: Low context recall on queries that should have answers in the corpus.

Fix: HyDE — generate a hypothetical answer and use that as the retrieval query; multi-query retrieval with paraphrases.

5

Outdated or Conflicting Information

Multiple retrieved chunks contain contradictory information (e.g., different emission factors from different years).

Diagnosis: Answer inconsistencies or explicit contradictions across retrieved chunks.

Fix: Add date metadata to chunks and filter by recency; add source reliability scoring; instruct the LLM to prefer the most recent source when conflicts arise.

6

Over-Retrieval / Distraction

Retrieving too many chunks dilutes the signal. Irrelevant chunks distract the model.

Diagnosis: Low faithfulness despite a seemingly relevant corpus.

Fix: Reduce K; apply re-ranking to filter aggressively; use relevance score thresholds to drop low-scoring chunks even if K = 10.

Production Rule of Thumb

Always log: (1) what was retrieved for each query, (2) the relevance scores, and (3) whether the answer was faithful. Without this, debugging production failures is nearly impossible.

Section 11

Advanced RAG Techniques

Selective

Self-RAG

The model learns to decide when to retrieve (not every query needs retrieval), critiques its own outputs, and selects the best generation among candidates. Uses special tokens: [Retrieve], [IsRel], [IsSup], [IsUse]. Key insight: unconditional retrieval hurts when the query doesn't need external knowledge.

Robust

Corrective RAG (CRAG)

After initial retrieval, a lightweight evaluator scores retrieval quality. If low, it triggers a web search fallback to augment with fresh information. If very low, it discards retrieved context entirely. Key insight: gracefully handles the case where the vector DB doesn't have the answer.

Agentic

Agentic RAG

An LLM agent orchestrates multi-step retrieval. It can decompose complex questions into sub-questions, iteratively retrieve, check if it has enough information, and decide to search again if not. ReAct pattern: Reason → Act → Observe loop.

Multi-hop

Multi-Hop RAG

Some questions require chaining multiple retrievals. E.g., 'What is the emissions intensity of the top supplier of product X?' requires: (1) find top supplier, (2) retrieve emissions data for that supplier. IRCoT (Interleaved Retrieval + Chain of Thought) interleaves reasoning steps with retrieval steps.

Fusion

RAG Fusion

Generate multiple query variants → retrieve for each → fuse results using Reciprocal Rank Fusion → generate from fused context. Higher recall than single-query RAG.

Section 12

Advanced RAG Concepts — Deep Dives

Three techniques that separate production-grade RAG from toy demos: multi-hop retrieval for questions that span multiple documents, structured knowledge handling for tables and relationships, and adaptive strategies that route queries to the right pipeline.

1 · Multi-Hop and Cross-Document Retrieval

Standard RAG works when the answer lives in one chunk. Multi-hop handles questions that require chaining facts across multiple documents — each retrieval step informs the next query.

Single-hop (standard RAG)

"What is Amazon's Scope 1 emissions?"
→ Retrieve one chunk from the sustainability report → done.

Multi-hop (chained retrieval)

"What is the emissions intensity of Amazon's top supplier in electronics?"
→ Hop 1: find top supplier → Hop 2: get their emissions → Hop 3: get their revenue → calculate.

Multi-hop retrieval — step by step
Query: "What is the emissions intensity of Amazon's top electronics supplier?"

Step 1: Retrieve for original query
  → "Foxconn is Amazon's largest electronics supplier"

Step 2: Reformulate with what was found
  → new query: "Foxconn annual emissions"
  → "Foxconn emitted 9.2M tonnes CO2e in 2023"

Step 3: Reformulate again
  → new query: "Foxconn annual revenue 2023"
  → "Foxconn revenue was $214B in 2023"

Step 4: Generate final answer
  → emissions intensity = 9.2M / 214B = 43 tonnes per $M revenue
Implementation patterns
IRCoT (Interleaved Retrieval + Chain of Thought):
  Interleave reasoning steps with retrieval steps.
  Reason: "I need the top supplier first"
  Retrieve: "Foxconn is top supplier"
  Reason: "Now I need Foxconn's emissions"
  Retrieve: "9.2M tonnes CO2e"
  Reason: "Now I need revenue to compute intensity"
  Retrieve: "$214B revenue"
  Generate: "Emissions intensity = 43 tonnes per $M"

ReAct pattern:
  Thought → Action (retrieve) → Observation → Thought → ...
  Loop until sufficient information to answer.

Why this matters for sustainability

Sustainability questions are inherently multi-hop. "Which of our top 10 suppliers has the highest emissions per dollar of spend?" requires supplier data, emissions data, and spend data — all from different sources. Cross-document retrieval is the core challenge: facts are spread across supplier lists, emissions disclosures, and financial reports.

2 · Table Extraction and Graph RAG

Standard RAG embeds text chunks. But huge amounts of sustainability data live in tables and relationship graphs — structures that text embeddings can't represent well.

Table Extraction

The problem — raw table as text loses structure
Raw table embedded as text (bad):
"Scope 1 2021 45000 Scope 2 2021 123000 Scope 3 2021 890000
 Scope 1 2022 42000 Scope 2 2022 118000..."

The embedding has no idea this is a 2D structure.
Rows and columns that have meaning are flattened away.
  • Table-to-text verbalization: Convert each cell to natural language before embedding. "In 2021, Scope 1 emissions were 45,000 tonnes. In 2022, Scope 1 decreased to 42,000 tonnes." The embedding now captures meaning.
  • Row-per-chunk: Each row becomes its own chunk with column headers prepended. "Year: 2021 | Category: Scope 1 | Value: 45,000 tonnes CO2e"
  • SQL / structured retrieval: Parse the table into a database. Convert natural language questions to SQL at query time. Most reliable for numerical questions and comparisons.
  • Multimodal embedding: Embed the table as an image. Models like GPT-4V can reason about table structure visually — useful when tables have complex merged cells or visual formatting.

Graph RAG

Standard RAG treats documents as isolated chunks with no explicit connections. Graph RAG builds a knowledge graph where entities are nodes and relationships are edges — then uses graph traversal for retrieval.

Standard RAG vs Graph RAG
Standard RAG index (disconnected chunks):
  Chunk 1: "Amazon partners with Rivian for electric delivery vans"
  Chunk 2: "Rivian produces zero-emission vehicles"
  Chunk 3: "Amazon's Scope 3 transport emissions fell 8% in 2023"
  → No explicit link between them.

Graph RAG index (explicit relationships):
  [Amazon] --PARTNERS_WITH--> [Rivian]
  [Rivian] --PRODUCES--> [Electric Delivery Vans]
  [Electric Delivery Vans] --REDUCES--> [Scope 3 Emissions]
  [Amazon] --HAS--> [Scope 3 Emissions]

Query: "How is Amazon reducing transport emissions?"
  → Graph traversal: Amazon → partners → Rivian
    → produces → electric vans → reduces → Scope 3

Microsoft GraphRAG (2024)

  • Extracts entities and relationships from all documents using an LLM
  • Builds a community structure — clusters of related entities
  • Generates summaries for each community
  • At query time uses both graph traversal and vector search
Graph RAG wins when…Standard RAG wins when…
Questions about relationships between entitiesSpecific factual lookups
Corpus-wide summarization ("main themes across all reports")Small corpus where graph construction overhead isn't worth it
Questions requiring global understandingKeyword-based queries without relational structure

3 · Adaptive Retrieval Strategies

A fixed pipeline treats every query the same way. Adaptive retrieval detects the query type and routes it to the right strategy — avoiding wasted retrieval for simple questions and under-powered retrieval for complex ones.

Why a fixed strategy fails
Query A: "What is CO2?"
→ General knowledge. LLM already knows this.
  Retrieval wastes time and adds noise. Don't retrieve.

Query B: "What were our Scope 3 emissions in Q3 2024?"
→ Specific factual lookup. Needs retrieval.
  Dense retrieval is fine — straightforward semantic match.

Query C: "Compare emissions reduction strategies across
          all suppliers in Southeast Asia"
→ Multi-document synthesis. Needs multi-hop retrieval.

Query D: "Verify this supplier's claimed 30% reduction"
→ Targeted evidence retrieval + citation checking.

A fixed pipeline applies the same strategy to all four.
Adaptive retrieval — routing logic
Step 1: Query classification
  - Does this need retrieval at all?
  - Is it simple (single-hop) or complex (multi-hop)?
  - Is it a lookup, comparison, or verification task?

Step 2: Route to the right strategy
  No retrieval needed  → straight to LLM
  Simple factual       → dense retrieval, top-3 chunks
  Complex multi-part   → multi-hop retrieval pipeline
  Comparison           → retrieve from multiple sources, synthesize
  Verification         → targeted retrieval + citation extraction

Step 3: Adjust K dynamically
  Simple query  → K=3 (less context needed)
  Complex query → K=10 (more context needed)

Step 4: Confidence-based fallback
  Retrieval confidence low → trigger web search (CRAG)

Self-RAG — the model decides when to retrieve

Self-RAG special tokens
The model learns to emit special tokens:
  [Retrieve]    → "I need to retrieve for this"
  [No Retrieve] → "I already know this, skip retrieval"
  [IsRel]       → "Is this retrieved chunk relevant?"
  [IsSup]       → "Does this chunk support my answer?"
  [IsUse]       → "Is this retrieval useful overall?"

The model scores its own retrievals and decides whether
to use them, retry, or generate without retrieval.

Interview One-Liner — All Three

Multi-hop handles questions where the answer requires chaining facts across multiple documents — each retrieval informs the next query. Table extraction and Graph RAG handle structured knowledge that text embeddings can't represent well — tables need verbalization or SQL, and relationship-heavy corpora benefit from a knowledge graph. Adaptive retrieval routes different query types to different strategies — simple factual queries get basic dense retrieval, complex analytical queries get multi-hop pipelines, and queries the model already knows skip retrieval entirely. In a sustainability context all three are critical because the data is heterogeneous, highly relational, and spans very different query types.

Section 13

Production RAG Considerations

Latency vs. Throughput Trade-offs

ComponentLatency Optimization
Embedding (query)Cache frequent queries; use smaller/distilled embedding model; GPU inference
ANN SearchTune HNSW ef parameter; use IVF for larger corpora; horizontal scaling
Re-rankingLimit K for re-ranking (top-20, not top-100); use distilled cross-encoder
LLM generationUse streaming; smaller model for simple queries; cache common answers

Caching

  • Query embedding cache: Cache query embeddings for repeated queries. LRU cache with TTL.
  • Semantic cache: Cache (query_embedding → answer) pairs. On new query, check if a semantically similar query was answered recently. Hit rate depends on query distribution.
  • Document embedding cache: Document embeddings are computed offline — no runtime cost.

Monitoring and Drift Detection

  • Retrieval drift: Monitor Context Precision/Recall over time. Drop = embedding drift or corpus change.
  • Answer quality: Monitor faithfulness and user feedback signals (thumbs up/down, follow-up questions).
  • Latency SLAs: P50, P95, P99 latency for each pipeline stage. Alert on P99 spikes.
  • Data freshness: Track when documents were last updated. Stale corpus = stale answers.

MLOps for RAG

  • Index versioning: Treat the vector index like a model artifact. Version it with DVC or a custom registry.
  • Embedding model registry: Never change the embedding model without re-indexing. Track model version as metadata.
  • Eval-driven development: Run RAGAS eval suite on every pipeline change before deploying. Gate on faithfulness > 0.85 and NDCG > 0.7 (example thresholds).
  • A/B testing RAG: Test chunking strategies, K values, or embedding models by routing traffic. Use NDCG and answer correctness as guardrail metrics.

Section 14

Interview Q&A — Quick Reference

Practice answering each in under 90 seconds.

Q1

What are the biggest failure modes in RAG and how do you detect them?

The three I've seen most: (1) Retrieval failure — wrong chunks retrieved, detected via low Context Recall in RAGAS or by logging what was retrieved. Fix: hybrid retrieval or fine-tune embeddings. (2) Hallucination — model adds facts not in context, detected via Faithfulness score. Fix: stronger system prompt, citation enforcement. (3) Context overflow — chunks silently truncated. Fix: token counting before LLM call, reduce K, apply re-ranking.

Q2

How do you choose between dense and sparse retrieval?

Default to hybrid. Dense retrieval excels at semantic queries where the user's words differ from document vocabulary. Sparse (BM25) excels at exact keyword matching — technical codes, proper nouns, domain-specific terms that must match exactly. In a sustainability context, supplier IDs and emissions factor codes must match exactly — BM25 is essential there. RRF fusion combines both without needing score normalization.

Q3

How do you evaluate a RAG system without ground-truth labels?

Two approaches. First, LLM-as-judge: use a separate LLM to score faithfulness (does the answer contradict the context?) and relevance (does the answer address the question?). This is the basis of RAGAS. Second, generate synthetic eval sets: use an LLM to generate question-answer pairs from your corpus, then evaluate retrieval against those. Neither is perfect but both are far better than no evaluation.

Q4

How does chunking affect RAG quality?

Chunking is one of the highest-impact decisions. Small chunks improve retrieval precision but lose context for generation. Large chunks preserve context but dilute the retrieval signal. In production I use hierarchical chunking: index small child chunks for retrieval, but return the parent chunk to the LLM. For sustainability reports, I'd also use structure-aware chunking — split at section boundaries since each section covers a distinct topic.

Q5

What's the difference between faithfulness and answer correctness?

Faithfulness measures whether the answer is grounded in the retrieved context — it can be faithful but still wrong if the retrieved context itself was wrong or incomplete. Answer correctness measures whether the answer matches a ground-truth reference. You need both: a faithful answer built on wrong retrieved information is still a wrong answer.

Q6

How would you build a RAG system for sustainability claim verification?

Pipeline: ingest sustainability reports, regulatory filings, and emissions databases → chunk by document section → embed with a domain-adapted model → index with hybrid retrieval. For a claim like 'Supplier X reduced emissions by 30%', retrieve relevant sections → use a cross-encoder to rank by relevance → feed to LLM with instruction to verify the claim and cite the exact passage. Output includes: verified/not-verified/insufficient-evidence, plus source citations with page numbers.

Section 15

Quick Reference Cheat Sheet

Chunking

  • Fixed-size → simple, cuts mid-sentence
  • Semantic → coherent but expensive
  • Hierarchical → best for production (retrieve small, generate with large)

Retrieval

  • BM25 → exact keyword match, no semantics
  • Dense (bi-encoder) → semantic, pre-computable
  • Hybrid + RRF → default production choice
  • Cross-encoder → re-ranking only, too slow for first-stage

Metrics

  • NDCG → graded relevance, order-aware, standard for search
  • MRR → only first relevant result matters
  • MAP → all relevant results matter, binary relevance
  • Faithfulness → hallucination detection
  • Context Precision → noisy retrieval detection

Vector DBs

  • FAISS → research/offline, no server
  • Chroma → dev/prototyping
  • Weaviate/Qdrant → production with filtering
  • Pinecone → managed cloud SaaS

Failure Modes → Fix

  • Wrong chunks → hybrid retrieval, fine-tune embeddings
  • Hallucination → stronger system prompt, citation enforcement
  • Context overflow → reduce K, re-rank, token count before LLM
  • Query mismatch → HyDE, multi-query

Advanced RAG

  • Self-RAG → selective retrieval
  • CRAG → fallback to web search
  • Agentic RAG → multi-step ReAct retrieval
  • HyDE → generate hypothesis, retrieve on hypothesis