> study notes · amazon applied scientist ii
Probability & Statistics
The mathematical foundations everything in ML is built on — Bayes' theorem, key distributions, MLE, hypothesis testing, confidence intervals, A/B testing, and Bayesian inference.
- Bayes' Theorem
- Distributions
- MLE
- Hypothesis Testing
- Confidence Intervals
- A/B Testing
- Bayesian Inference
- MAP
> Table of Contents
Section 1
Conditional Probability
A and B are two events (outcomes) of a random experiment. Conditional probability asks: given that B happened, how likely is A?
The Inverse Problem — P(B|A) from P(A|B)
Given P(A|B), how do you get P(B|A)? Apply the multiplication law to both orderings of the joint probability P(A∩B):
Law of Total Probability
If events B₁, B₂, …, Bₙ partition the sample space Ω (they are mutually exclusive and exhaustive: Ω = ∪Bᵢ, Bᵢ ∩ Bⱼ = ∅ for i ≠ j), then:
Why this matters for Bayes' theorem
The law of total probability shows where the denominator P(B) in Bayes' theorem comes from — it's not mysterious, it's just the sum over all ways B could have happened. It acts as a normalizing constant ensuring the posterior probabilities sum to 1.
Section 2
Bayes' Theorem
Describes how to update a belief (probability) when new evidence arrives. The mathematical foundation of Bayesian inference, spam filters, medical diagnosis, and RAG fact verification.
Worked Example — Supplier Fraud Detection
Known:
P(Fraud) = 0.05 ← 5% of suppliers commit fraud (prior)
P(Flag | Fraud) = 0.90 ← 90% of fraud cases get flagged
P(Flag | Legit) = 0.10 ← 10% of legit cases also get flagged
Step 1: P(Flag) = P(Flag|Fraud)·P(Fraud) + P(Flag|Legit)·P(Legit)
= 0.90×0.05 + 0.10×0.95 = 0.045 + 0.095 = 0.14
Step 2: P(Fraud | Flag) = 0.90 × 0.05 / 0.14 = 0.321
Even with a flag, only 32% chance of actual fraud.
Why? Fraud is rare (5%) — most flags are false positives.
This is the base rate fallacy — ignoring the prior destroys inference.The Base Rate Fallacy
P(Disease) = 0.001 (1 in 1000)
P(Positive | Disease) = 0.99
P(Positive | Healthy) = 0.01
P(Disease | Positive) = 0.99×0.001 / (0.99×0.001 + 0.01×0.999)
= 0.00099 / 0.01098 = 0.09 = 9%
A 99% accurate test on a rare disease → positive is only 9% likely true.
The rarity of the disease dominates.▸ Interview Answer
Bayes' theorem updates a prior belief P(A) with new evidence B to get the posterior P(A|B). The key insight: you cannot ignore the prior — rare events remain unlikely even after positive tests (base rate fallacy). In sustainability this applies to supplier fraud detection, emissions claim verification, and any Bayesian risk model.
Section 3
Independence
Two events A and B are independent if knowing one occurred gives no extra information about the other.
Concrete Example — Cards
P(A) = 13/52 = 1/4 (13 spades in 52 cards)
P(B) = 4/52 = 1/13 (4 queens in 52 cards)
Are they independent? Check:
P(A ∩ B) = P(queen of spades) = 1/52
P(A)·P(B) = 1/4 × 1/13 = 1/52 ✓
They match → A and B are independent.
Knowing a card is a queen doesn't change the probability it's a spade.Independence vs Mutual Exclusivity
Independent: P(A∩B) = P(A)·P(B)
Knowing A happened tells you nothing about B. Both can happen simultaneously.
Mutually exclusive: P(A∩B) = 0
They cannot both happen. Knowing A happened tells you B did NOT happen — the most extreme dependence.
Reliability — Independence in Practice
When components fail independently, the product rule gives exact reliability calculations.
SERIES — all must succeed (chain fails if any link fails):
P(series success) = (1-p)ⁿ = (0.95)¹⁰ ≈ 0.60
P(series failure) = 1 - (1-p)ⁿ = 1 - 0.95¹⁰ ≈ 0.40
Even with 95% reliable components, a chain of 10
has only 60% chance of working.
PARALLEL — any one succeeding is enough:
P(parallel failure) = pⁿ = (0.05)¹⁰ ≈ 10⁻¹³
P(parallel success) = 1 - pⁿ ≈ 1.0
Redundancy drives failure probability to near zero.ML Connection
Independence assumptions power most of ML. Naive Bayes classifies text by assuming words are independent given the class. MLE treats training examples as independent draws. Bootstrapping in Random Forests relies on approximate independence between trees. When independence is violated (e.g. time series), standard methods break.
Section 4
PDF and CDF
Two fundamental ways to describe a probability distribution. Every distribution you encounter in ML can be described by both.
Discrete vs Continuous Random Variables
| Discrete | Continuous |
|---|---|
| Takes countable values: 0, 1, 2, … | Takes any value in an interval: x ∈ ℝ |
| Described by PMF — Probability Mass Function | Described by PDF — Probability Density Function |
| P(X=k) gives exact probability of each value | P(X=x) = 0 for any single point — only intervals have probability |
| Example: coin flips, audit failure counts | Example: emissions values, model scores, wait times |
PMF — Probability Mass Function (discrete)
PDF — Probability Density Function (continuous)
CDF — Cumulative Distribution Function
CDF in Practice
Why PDF > 1 is not a problem
A PDF value of 3.0 at some point x just means the distribution is very concentrated there. It's a density, not a probability. The probability of a tiny interval [x, x+dx] is f(x)·dx — and for small enough dx this is always ≤ 1. Only areas (integrals) are probabilities.
Section 5
Key Probability Distributions
Gaussian (Normal) — N(μ, σ²)
Gaussian in ML
- →Linear/Logistic Regression: assumes Gaussian errors (residuals)
- →Gaussian Discriminant Analysis: models P(x|y) as Gaussian per class
- →Bayesian inference: Gaussian prior + Gaussian likelihood = Gaussian posterior (conjugate)
- →Neural network weights: initialized from N(0, σ²) by default
Bernoulli — Bernoulli(p)
Binomial — Binomial(N, p)
Poisson — Poisson(λ)
Geometric — Geometric(p)
Memoryless property
P(X > m+n | X > m) = P(X > n). If fraud hasn't been caught in the first m audits, the remaining wait has the same distribution as starting fresh — each trial is independent so the past gives no information about the future.
Exponential — Exp(λ)
The continuous counterpart of the geometric — models the time until the first event in a Poisson process. Where geometric counts discrete trials, exponential measures continuous waiting time.
Memoryless property — unique among continuous distributions
P(X > s+t | X > s) = P(X > t). Having already waited s seconds tells you nothing about how much longer you'll wait. The exponential is the only continuous distribution with this property — the continuous analogue of geometric's memorylessness.
Geometric ↔ Exponential Bridge
| Geometric(p) | Exponential(λ) |
|---|---|
| Discrete — counts trials | Continuous — measures time |
| PMF: (1-p)^(k-1)·p | PDF: λe^(-λx) |
| E[X] = 1/p | E[X] = 1/λ |
| Var(X) = (1-p)/p² | Var(X) = 1/λ² |
| Memoryless: past trials irrelevant | Memoryless: past wait time irrelevant |
| Use: trials until first success | Use: wait time between Poisson events |
Gamma — Gamma(α, β)
Generalises the exponential — models the waiting time until the α-th event in a Poisson process. Where Exponential(λ) is the wait for the 1st event, Gamma(α, λ) is the wait for the α-th event.
Special cases
α = 1: Gamma(1, β) = Exponential(β)
← wait for 1st event = exponential
α = k/2, β = 1/2: Gamma(k/2, 1/2) = Chi-squared(k)
← sum of k squared standard normals
α integer: Erlang distribution
← used in queueing theory (call centre wait times)
As α → ∞: Gamma → Gaussian (by CLT — sum of α exponentials)Supplier has audit failures at rate β = 2 per year → Poisson(2)
X = time until the 3rd failure → X ~ Gamma(α=3, β=2)
E[X] = 3/2 = 1.5 years until the 3rd failure
Var(X) = 3/4 = 0.75, Std = 0.87 years
Compare to Exponential(2) — wait for 1st failure:
E[X] = 1/2 = 0.5 years
Each additional event adds 1/β = 0.5 years on averageConjugate prior for Poisson and Exponential
Gamma is the conjugate prior for both the Poisson rate λ and the Exponential rate λ. Prior Gamma(α, β) + n Poisson observations summing to Σxᵢ → Posterior Gamma(α + Σxᵢ, β + n). This is why Gamma appears throughout Bayesian inference.
Chi-Squared — χ²(k)
The distribution of the sum of squares of k independent standard normal random variables. It is a special case of Gamma and appears everywhere hypothesis testing touches variance.
Start with Z ~ N(0,1) — a standard normal.
Step 1: What is the distribution of Z²?
Z is symmetric around 0, so Z² is always non-negative.
For z > 0, the CDF of Z² is:
P(Z² ≤ z) = P(-√z ≤ Z ≤ √z) = 2Φ(√z) - 1
Differentiate to get the PDF of Z²:
f(z) = (1/√z) · φ(√z) = (1/√(2πz)) · e^(-z/2)
This is Gamma(1/2, 1/2) — also written χ²(1).
Step 2: Sum k independent Z²s.
Let Z₁, Z₂, ..., Zₖ ~ N(0,1) independently.
Define: X = Z₁² + Z₂² + ... + Zₖ²
Each Zᵢ² ~ Gamma(1/2, 1/2)
Sum of k independent Gammas with same β:
Gamma(α₁,β) + Gamma(α₂,β) = Gamma(α₁+α₂, β)
Therefore: X ~ Gamma(k/2, 1/2) = χ²(k) ✓
PDF: f(x) = x^(k/2-1) · e^(-x/2) / (2^(k/2) · Γ(k/2)) for x ≥ 0k = degrees of freedom
E[X] = k
Var(X) = 2k
Shape:
k=1,2: right-skewed, peak near 0
k=3-10: moderate right skew
k→∞: approaches N(k, 2k) by CLT
Additivity: χ²(m) + χ²(n) = χ²(m+n) ← independent sums
Special cases:
k=1: χ²(1) = Z² (square of a standard normal)
k=2: χ²(2) = Exponential(1/2) (appears in survival analysis)Where Chi-squared appears in statistics
- →Chi-squared test: H₀: no association between categorical variables. Test statistic χ² = Σ(O-E)²/E follows χ²(df) under H₀. Used for contingency tables, goodness-of-fit.
- →Sample variance: (n-1)S²/σ² ~ χ²(n-1). This is why hypothesis tests on variance use the chi-squared distribution.
- →Confidence interval for σ²: CI for population variance: [(n-1)S²/χ²(α/2), (n-1)S²/χ²(1-α/2)].
- →Likelihood ratio tests: -2·log(likelihood ratio) → χ²(df) asymptotically. Standard test for model comparison in ML.
Are supplier risk categories uniformly distributed?
Observed: Low=45, Medium=30, High=25 (n=100)
Expected under H₀ (uniform): 33.3, 33.3, 33.3
χ² = (45-33.3)²/33.3 + (30-33.3)²/33.3 + (25-33.3)²/33.3
= 4.11 + 0.33 + 2.07
= 6.51
df = categories - 1 = 2
Critical value χ²(0.05, df=2) = 5.99
6.51 > 5.99 → reject H₀ → distribution is not uniformDistribution Comparison
| Distribution | Models | E[X] | Var(X) |
|---|---|---|---|
| Bernoulli(p) | Single binary outcome | p | p(1-p) |
| Binomial(N,p) | Count of successes in N trials | Np | Np(1-p) |
| Geometric(p) | Trials until first success (discrete) | 1/p | (1-p)/p² |
| Poisson(λ) | Count of rare events in interval | λ | λ |
| Exponential(λ) | Wait time until first event (continuous) | 1/λ | 1/λ² |
| Gamma(α,β) | Wait time until α-th event | α/β | α/β² |
| Chi-squared χ²(k) | Sum of k squared standard normals | k | 2k |
| Gaussian(μ,σ²) | Continuous symmetric outcomes | μ | σ² |
Section 6
Concentration Inequalities: Markov & Chebyshev
Concentration inequalities answer: how likely is a random variable to deviate far from its mean? Markov and Chebyshev form the base of a chain that leads all the way to the CLT — each step uses more information about the distribution to get a tighter bound.
Markov's Inequality
For any non-negative random variable X and any constant a > 0:
Proof
Define the indicator 1{X ≥ a} = 1 if X ≥ a, else 0. Since X ≥ 0 and a > 0:
Geometric intuition
Think of E[X] as the total probability mass spread across the real line. Markov asks: how much of that mass can possibly live at or beyond position a? At most E[X]/a of it. If the mean is 10 and you ask about a = 100, at most 10% of mass can be there. Deliberately loose — it knows only the mean, nothing about the shape. That is the point.
ML applications
- →Generalization bounds: bounding P(empirical risk deviates from expected risk) — the starting point for PAC learning theory
- →Union bound arguments: combine Markov with union bound to bound any bad event across multiple hypotheses
- →Any tail bound where only the mean of a non-negative quantity (loss, error) is known
Chebyshev's Inequality
For any random variable X with finite mean μ and variance σ², and any k > 0:
Proof — Chebyshev is Markov applied to (X − μ)²
Quadratic improvement over Markov
Markov decays as 1/k (linear). Chebyshev decays as 1/k² (quadratic) — each standard deviation of distance you move out, the probability bound drops as the square. Chebyshev also applies to variables that can be negative, unlike Markov. It holds for ANY distribution with finite mean and variance — Gaussian, heavy-tailed, discrete, skewed. The cost of this generality is looseness: for N(0,1), Chebyshev gives P(|X|≥3) ≤ 1/9 ≈ 0.11 versus the true value 0.003.
ML applications
- →Confidence intervals without distributional assumptions: any estimator with known variance satisfies Chebyshev bounds
- →Bias-variance tradeoff: Var(estimator) controls how concentrated estimates are around the mean
- →Dropout as variance reduction: Chebyshev explains why reducing prediction variance improves worst-case guarantees
- →Online learning: bounding regret in terms of variance of loss functions
Markov vs Chebyshev — Head-to-Head
| Property | Markov | Chebyshev |
|---|---|---|
| Requires | X ≥ 0 and E[X] | Finite μ and σ² (any sign) |
| Bound form | E[X] / a | σ² / k² |
| Decay rate | 1/k (linear) | 1/k² (quadratic) |
| Sign restriction | X must be non-negative | No restriction |
| Generalizes to | Chernoff bounds (via MGF) | LLN (applied to X̄) |
Section 7
Law of Large Numbers
The formal guarantee that more data always helps. As the sample size grows, the sample mean concentrates around the true mean — this is the theoretical foundation of every supervised learning algorithm.
Statement
Let X₁, X₂, …, Xₙ be i.i.d. with mean μ and finite variance σ². Define the sample mean:
WLLN: convergence in probability. SLLN: almost sure convergence (strictly stronger). For interviews, WLLN is the one to prove.
Proof of WLLN via Chebyshev
The key structural insight
Averaging n i.i.d. variables does two things simultaneously: (1) preserves the mean — E[X̄ₙ] = μ always; (2) shrinks the variance at rate 1/n. This is why more data always helps — not a heuristic, a theorem. The distribution of X̄ₙ does not shift; it collapses onto a single point μ.
ML Applications
- →Empirical risk minimization: E_empirical[loss] → E_true[loss] as dataset size grows. The entire foundation of supervised learning.
- →Monte Carlo estimation: Averaging n samples of f(X) converges to E[f(X)]. Used everywhere in RL, variational inference, MCMC.
- →Mini-batch gradient descent: The mini-batch gradient is an unbiased estimator of the full-batch gradient. LLN says it concentrates as batch size grows.
- →Why validation sets work: Empirical accuracy on held-out data converges to true accuracy. LLN is the formal justification.
Section 8
Moment Generating Functions (MGF)
The MGF is the expectation E[e^{tX}], a single function whose n-th derivative at t=0 returns the n-th moment. Because it turns sums of independent variables into products, it drives both the CLT proof and exponentially tight Chernoff bounds.
Definition
Key Properties
- →Uniqueness: If two distributions have the same MGF in a neighborhood of 0, they are identical. MGF uniquely identifies a distribution.
- →Independence → multiplication: If X and Y are independent:Convolution in distribution space becomes multiplication in MGF space.
- →Scaling:
MGFs of Common Distributions
| Distribution | MGF M_X(t) |
|---|---|
| Bernoulli(p) | 1 − p + p·eᵗ |
| Binomial(n, p) | (1 − p + p·eᵗ)ⁿ |
| Poisson(λ) | exp(λ(eᵗ − 1)) |
| Normal(μ, σ²) | exp(μt + σ²t²/2) |
| Exponential(λ) | λ/(λ − t), t < λ |
Chernoff Bounds — Exponential Tails
Apply Markov to e^{tX} rather than X itself — since e^{tX} is non-negative, Markov applies, and optimizing over t gives exponentially tight tail bounds:
Chernoff bounds decay exponentially in a, vs Chebyshev's polynomial 1/a². Tightness ladder: Markov (1/a) ≪ Chebyshev (1/a²) ≪ Chernoff (e^{−ca}).
ML applications
- →PAC learning and boosting theory (AdaBoost proof): Chernoff bounds give exponentially tight generalization guarantees
- →Log partition function in exponential family models: log M_X(t) — the cumulant generating function — has derivatives that give the mean and variance directly
- →Variational inference: KL divergence between exponential family distributions derives from MGF structure
Section 9
Central Limit Theorem
One of the most powerful results in probability theory. It explains why the Gaussian distribution appears everywhere — and why statistical inference works even when you don't know the underlying distribution.
Why It Matters
- →Hypothesis tests (Z-test, t-test) use normal/t distributions even when the data is not normal — valid for large n by CLT
- →Confidence intervals are symmetric around the mean — justified by CLT
- →MLE estimates are approximately Gaussian for large n — enables asymptotic inference
- →With n ≥ 30, you can almost always use the normal approximation in practice
Binomial → Gaussian
A Binomial(n, p) is the sum of n independent Bernoulli(p) random variables. By CLT, this sum converges to a Gaussian as n grows.
X ~ Binomial(n, p)
E[X] = np
Var(X) = np(1-p)
CLT applies because X = X₁ + X₂ + ... + Xₙ, each Xᵢ ~ Bernoulli(p)
As n → ∞:
X ~ N(np, np(1-p)) approximately
Standardized:
Z = (X - np) / √(np(1-p)) → N(0,1)
When does the approximation hold?
Rule of thumb: np ≥ 5 AND n(1-p) ≥ 5
Examples:
n=10, p=0.5: np=5 ← borderline, still skewed
n=30, p=0.5: np=15 ← good approximation
n=100, p=0.1: np=10 ← good (both conditions met)
n=100, p=0.01: np=1 ← BAD — use Poisson insteadVisualising the convergence — Binomial(n, 0.5)
n = 5
discrete, lumpy
n = 20
smoother bell
n = 100
nearly Gaussian
As n ↑: discrete mass concentrates into a continuous bell curve
Poisson as a Limit of Binomial
Poisson is not just related to Binomial by analogy — it is exactly the Binomial in the limit n→∞, p→0 with the product np = λ held fixed. Here is the derivation.
X ~ Binomial(n, p)
P(X = k) = C(n,k) · pᵏ · (1-p)^(n-k)
Let p = λ/n (so that np = λ is fixed as n → ∞)
Substitute:
P(X = k) = C(n,k) · (λ/n)ᵏ · (1 - λ/n)^(n-k)C(n,k) · (λ/n)ᵏ = [n! / (k!(n-k)!)] · λᵏ/nᵏ
= λᵏ/k! · [n(n-1)(n-2)···(n-k+1)] / nᵏ
= λᵏ/k! · [1 · (1-1/n) · (1-2/n) ··· (1-(k-1)/n)]
As n → ∞, each factor (1 - j/n) → 1 for fixed j, so:
→ λᵏ / k!(1 - λ/n)^(n-k) = (1 - λ/n)ⁿ · (1 - λ/n)^(-k)
As n → ∞:
(1 - λ/n)ⁿ → e^(-λ) ← definition of e (or limit of compound interest)
(1 - λ/n)^(-k) → 1 ← k is fixed, λ/n → 0
Together: (1 - λ/n)^(n-k) → e^(-λ)P(X = k) = C(n,k) · (λ/n)ᵏ · (1 - λ/n)^(n-k)
→ (λᵏ / k!) · e^(-λ)
= e^(-λ) λᵏ / k! ← Poisson(λ) PMF ✓
The Binomial(n, λ/n) converges to Poisson(λ) as n → ∞.The physical interpretation
Divide a time interval into n tiny sub-intervals. In each, an event can happen with probability p = λ/n (small). Sub-intervals are independent. The total count of events is Binomial(n, λ/n). As n→∞ (continuous time, infinitesimally small intervals) with average rate λ fixed, the count converges to Poisson(λ). This is why Poisson models counts of rare independent events — server requests per second, audit violations per year, radioactive decays per minute.
Poisson → Gaussian
A Poisson(λ) can be thought of as the sum of many independent rare events. As λ grows, CLT kicks in and the distribution becomes Gaussian.
X ~ Poisson(λ)
E[X] = λ, Var(X) = λ (both equal λ)
A Poisson(λ) is the limit of Binomial(n, p) as n→∞, p→0, np=λ.
Applying CLT to that Binomial:
As λ → ∞:
X ~ N(λ, λ) approximately
Standardized:
Z = (X - λ) / √λ → N(0, 1)
When does the approximation hold?
Rule of thumb: λ ≥ 10
Examples:
λ = 1: very skewed, right tail dominant — CLT has not kicked in
λ = 5: still noticeably skewed
λ = 10: reasonable approximation
λ = 30: excellent approximation, nearly symmetric bellWhy Poisson is right-skewed for small λ
Poisson(λ=1): P(X=0) = 0.368, P(X=1) = 0.368, P(X=2) = 0.184, P(X=3) = 0.061 … Most of the probability mass is near 0, with a long right tail. Mean = Variance = 1. As λ grows, the distribution can spread out symmetrically on both sides — the left tail has room to develop — and the bell shape emerges.
The Three Limits — Summary
| Distribution | Limit condition | Approaches | Approximation rule |
|---|---|---|---|
| Binomial(n, p) | n → ∞, p fixed | N(np, np(1-p)) | np ≥ 5 and n(1-p) ≥ 5 |
| Binomial(n, p) | n → ∞, p → 0, np = λ | Poisson(λ) | n ≥ 20, p ≤ 0.05 |
| Poisson(λ) | λ → ∞ | N(λ, λ) | λ ≥ 10 |
Proof via MGFs
This is the cleanest proof. It assumes the MGF exists in a neighborhood of 0.
What the CLT Does NOT Say
- →It does NOT say the distribution of individual Xᵢ is Gaussian — inputs can be anything
- →It does NOT say convergence is fast — heavy-tailed distributions may need very large n
- →It does NOT apply without finite variance — Cauchy distribution (no finite variance) does not satisfy CLT
- →The Berry-Esseen theorem quantifies the finite-n error: sup_z |P(Zₙ ≤ z) − Φ(z)| ≤ C·E[|Y|³]/√n ≈ 0.48/√n
The Full Chain: Markov → CLT
All five results are answers to the same question: how concentrated is a random variable around its mean?
| Result | Role in the Chain |
|---|---|
| Markov | Tail bound from mean alone. Requires only E[X] ≥ 0. Most primitive. |
| Chebyshev | Markov applied to (X−μ)². Quadratic improvement using variance. |
| Law of Large Numbers | Chebyshev on X̄ₙ, whose variance is σ²/n → 0. Shows X̄ₙ → μ. |
| MGF | Encodes all moments. Converts distributional convergence into pointwise function convergence. |
| CLT | MGF of standardized sum → e^{t²/2}. Fluctuations around μ are always Gaussian at rate 1/√n. |
▸ Interview Answer
CLT states that the sample mean of n iid variables with mean μ and variance σ² converges to N(μ, σ²/n) regardless of the original distribution. This is why normal approximations work: Binomial(n,p) → N(np, np(1-p)) when np≥5 and n(1-p)≥5, and Poisson(λ) → N(λ,λ) when λ≥10. In both cases the discrete distribution becomes a continuous bell curve because the sum of many small independent contributions — however distributed — always converges to Gaussian. The MGF proof shows WHY: the MGF of the standardized sum converges pointwise to e^{t²/2}, the unique MGF of the standard normal.
Section 10
Random Sampling & Parameter Estimation
Random sampling is the process of selecting a subset from a population such that every sample has a known, non-zero probability of being selected. It is the mechanism that makes statistical inference possible — allowing you to draw conclusions about a population from a subset.
Why It Matters
Without random sampling, estimates are biased. The classic failure: the 1936 Literary Digest poll predicted Landon over Roosevelt by sampling phone book and car registration owners — systematically missing poor voters. The sample size was 2.4 million. Still wrong. Bias cannot be fixed by more data — only by better sampling.
Simple Random Sampling (SRS)
Every subset of size n from a population of size N is equally likely. Each individual has selection probability n/N. The sample mean is an unbiased estimator of the population mean:
The factor (1 − n/N) is the finite population correction — it vanishes when n ≪ N, recovering the standard σ²/n from the LLN.
Sample Mean — Estimating μ
x̄ is an unbiased estimator of μ — no systematic over- or under-estimate.
Sample Variance — Estimating σ² (Bessel's Correction)
The naive estimator dividing by n is biased because x̄ is computed from the same sample — it is closer to each xᵢ than μ is, systematically underestimating spread.
Dividing by n−1 instead of n corrects for the lost degree of freedom. Once x̄ is fixed, only n−1 of the deviations (xᵢ−x̄) are free — the last is determined by Σ(xᵢ−x̄) = 0.
Geometric intuition: n data points live in ℝⁿ. Computing x̄ projects onto a 1-dimensional subspace — the constraint Σ(xᵢ−x̄) = 0 confines residuals to an (n−1)-dimensional subspace. Variance is computed in this subspace, so we divide by its true dimension n−1.
Standard Error of the Mean
The variance of x̄ as an estimator — how much it fluctuates across different samples:
For small n the t-distribution accounts for extra uncertainty from estimating σ with s — heavier tails than Gaussian, converging to Gaussian as n grows.
Summary
| Quantity | Population (unknown) | Sample estimator | Biased? |
|---|---|---|---|
| Mean | No | ||
| Variance | No | ||
| Std dev | Slightly yes* | ||
| Std error | No |
*s is slightly biased for σ because E[√s²] ≠ √E[s²] by Jensen's inequality — bias is O(1/n) and negligible in practice.
▸ Interview Answer
Bias cannot be fixed with more data — only with better sampling. x̄ is unbiased for μ. s² divides by n−1 (Bessel's correction) because x̄ is estimated from the same data, consuming one degree of freedom. The SE = s/√n measures how precisely x̄ estimates μ; divided by SE, x̄ follows a t-distribution for small n and converges to standard normal as n→∞ by CLT.
Section 11
Information Theory for ML
The mathematical backbone of cross-entropy loss, KL divergence, decision tree splits, and feature selection. Understanding these concepts lets you derive why standard ML losses are the right ones, not just memorize them.
1. Information Content (Surprise)
Three axioms uniquely determine this formula: certain events carry zero information I(1)=0; rarer events carry more; independent events are additive I(p₁·p₂) = I(p₁)+I(p₂). Only −log p satisfies all three. Log base 2 → bits. Natural log → nats.
| Event | p | Surprise (bits) |
|---|---|---|
| Fair coin heads | 0.5 | 1.0 |
| Rolling a 6 | 0.167 | 2.58 |
| 1-in-1000 event | 0.001 | 9.97 |
| Certain event | 1.0 | 0.0 |
2. Entropy — Average Surprise
- →H(P) ≥ 0 always
- →H(P) = 0 when outcome is certain
- →H(P) maximized when distribution is uniform — maximum uncertainty
- →Irreducible: you cannot compress data below H(P) (Shannon's source coding theorem)
3. Cross-Entropy — Cost of Being Wrong
Measures average surprise when model Q encodes events from true distribution P. Equal to H(P) only when Q=P exactly. Three equivalent justifications for using it as the classification loss:
- →Information theory: Minimizes bits wasted encoding reality with your model.
- →Maximum likelihood: Minimizing cross-entropy = maximizing likelihood — same objective, two framings.
- →Gradient: Gradient of cross-entropy w.r.t. logits after softmax is simply (ŷ − y) — clean and stable.
4. NLL — Negative Log-Likelihood
- →NLL = Cross-Entropy for classification: Identical formula — MLE framing vs information theory framing.
- →NLL = MSE for regression: Under Gaussian noise assumption: taking the log of the Gaussian PDF and negating yields MSE plus a constant. If residuals aren't Gaussian, MSE is the wrong loss.
5. KL Divergence — Distance Between Distributions
Extra surprise from using Q instead of P. Not symmetric: DKL(P‖Q) ≠ DKL(Q‖P) — not a true distance metric.
- →min D_KL(P‖Q): Q covers all modes of P — mean seeking, spreads mass everywhere P has mass.
- →min D_KL(Q‖P): Q collapses onto one mode of P — mode seeking, ignores regions where Q is near zero. VAEs minimize this direction.
6. Conditional Entropy & Information Gain
Decision trees: at each node pick the feature X that maximizes IG — maximum label uncertainty removed. H(Y|X)=0 when X fully determines Y; H(Y|X)=H(Y) when X tells you nothing.
7. Mutual Information
- →I(X;Y) ≥ 0 always
- →I(X;Y) = 0 iff X and Y are independent — joint factorizes into product of marginals
- →I(X;Y) = I(Y;X) — symmetric, unlike KL
- →I(X;X) = H(X) — a variable shares all information with itself
Pearson Correlation
Measures linear relationship between X and Y:
- →Range: [−1, 1] — ρ=1 perfect positive linear, ρ=0 no linear relationship (nonlinear dependence can still exist)
- →Assumes continuous variables; sensitive to outliers
Spearman Correlation
Pearson applied to ranks instead of raw values:
- →Catches any monotonic relationship — if X goes up, Y consistently goes up or down, regardless of linearity
- →Robust to outliers (ranks are bounded); works on ordinal data
- →Example: Y = X³ — perfectly monotonic, ρₛ = 1.0, but Pearson ρ < 1 because the relationship is nonlinear
Feature selection comparison
| Method | Formula | Detects | Categorical? | Outlier Robust? |
|---|---|---|---|---|
| Pearson | Cov(X,Y) / σₓσᵧ | Linear only | No | No |
| Spearman | Pearson on ranks | Monotonic | Ordinal only | Yes |
| Mutual Information | Σ p(x,y) log p(x,y)/p(x)p(y) | Any dependency | Yes | Yes |
8. Label Smoothing — Entropy as Regularization
Hard one-hot targets have H(P)=0 — cross-entropy pushes logits toward infinity, causing overconfidence. Label smoothing (ε=0.1) softens targets: [0,0,1,0] → [0.025, 0.025, 0.925, 0.025]. The second term directly penalizes overconfident predictions — regularization through entropy. Prevents logits growing unbounded, improves calibration, reduces train/inference gap.
Everything connected in one line
Cross-entropy = irreducible uncertainty + cost of model being wrong. Training minimizes the second term. H(P) is the floor you can never beat.
Section 12
Maximum Likelihood Estimation (MLE)
Finds the parameter values that make the observed data most probable. The foundation of most ML training — log loss, MSE, and cross-entropy all derive from MLE.
MLE for Gaussian — Deriving Mean and Variance
MLE for Bernoulli — Why Log Loss
Properties of MLE
- →Consistency: As n→∞, θ_MLE → θ_true. Converges to truth with more data.
- →Asymptotic normality: For large n, θ_MLE ~ N(θ_true, I(θ)⁻¹) where I is Fisher information.
- →Efficiency: Achieves the Cramér-Rao lower bound — lowest possible variance among unbiased estimators.
- →Invariance: If θ_MLE is MLE of θ, then g(θ_MLE) is MLE of g(θ).
▸ Interview Answer
MLE finds parameters that maximize the probability of observing the training data. It connects directly to loss functions: MLE on Gaussian data gives MSE loss, MLE on Bernoulli data gives log loss. Log-likelihood is used in practice because products of probabilities underflow numerically and logs turn products into sums. MLE is consistent and asymptotically efficient but can overfit on small datasets — adding a prior gives MAP estimation.
Section 13
Hypothesis Testing
A formal framework for deciding whether observed data provides enough evidence to reject a default assumption. The foundation of A/B testing and scientific inference.
Step 1: State hypotheses
H₀ (null): default assumption — no effect, no difference
H₁ (alternative): what you want to prove
Step 2: Choose significance level α
α = P(reject H₀ | H₀ is true) = false positive rate
Standard: α = 0.05
Step 3: Compute test statistic
Summarize data into a single number.
Under H₀ this statistic follows a known distribution.
Step 4: Compute p-value
p-value = P(observing data this extreme | H₀ is true)
Small p-value → data unlikely under H₀ → evidence against H₀
Step 5: Decision
p-value < α → reject H₀ (statistically significant)
p-value ≥ α → fail to reject H₀ (not enough evidence)
IMPORTANT: failing to reject H₀ ≠ proving H₀Type I and Type II Errors
| Type I Error (α) | Type II Error (β) | |
|---|---|---|
| Also called | False Positive | False Negative |
| Definition | Reject H₀ when H₀ is true | Fail to reject H₀ when H₁ is true |
| Probability | α (you choose this) | β (depends on effect size, n) |
| Fix | Lower α | Increase sample size |
| Example | Flag clean supplier as fraudulent | Miss fraudulent supplier |
Power = 1 − β
P(correctly rejecting H₀ when H₁ is true). Increases with larger n, larger true effect size, lower data variance. Standard: aim for power ≥ 0.80.
Common Test Statistics
p-value Misconception
p-value is NOT the probability that H₀ is true. It is P(data this extreme | H₀ is true). A p-value of 0.03 does not mean 97% chance the effect is real — it means the data would occur 3% of the time if H₀ were true.
Section 14
Confidence Intervals
A confidence interval is a random interval constructed from sample data that contains the true population parameter with a specified probability — the confidence level 1−α.
Correct vs Wrong Interpretation
✗ Wrong
"There is a 95% probability that μ is in this interval." — μ is fixed (not random). This interval either contains it or it doesn't.
✓ Correct
If you repeated the experiment infinitely many times, 100(1−α)% of the constructed intervals would contain the true μ. The interval is random, not the parameter.
Construction from CLT
For large n, by CLT the standardized sample mean is approximately standard normal. We find zᵉ⁄₂ such that the central probability equals 1−α, then rearrange to isolate μ:
Worked Example — NFL Punter
A punter historically averages 41 yards/punt with SD 8 yards. What is the probability their next 40 punts average at least 45 yards?
The same standardization step — divide by SE = σ/√n, subtract μ — is exactly what CI construction inverts: instead of computing a probability from a fixed x̄, the CI finds all μ values consistent with the observed x̄.
Critical values for common levels
| Confidence Level | α | zᵉ⁄₂ |
|---|---|---|
| 90% | 0.10 | 1.645 |
| 95% | 0.05 | 1.960 |
| 99% | 0.01 | 2.576 |
Small Samples — t-Distribution
When n is small, σ is unknown and estimated by s — adding uncertainty. The statistic follows a t-distribution with n−1 degrees of freedom:
tₙ₋₁ has heavier tails than N(0,1) — reflecting the extra uncertainty from estimating σ with s. As n→∞, tₙ₋₁ → N(0,1) and the two intervals converge.
Width of the CI — Three Levers
- →Confidence level ↑ → z_{α/2} larger → wider interval. More confidence costs precision.
- →Sample size n ↑ → width shrinks at 1/√n. To halve the width, quadruple n.
- →Population variance σ² ↑ → wider interval. More variable population needs more data.
Connection to Hypothesis Testing — Duality
CI and hypothesis testing are dual formulations of the same question. A two-sided test of H₀: μ = μ₀ at level α rejects iff μ₀ falls outside the 1−α CI:
The CI is the set of all μ₀ values you would fail to reject at level α. They encode identical information — different presentations of the same inference.
Bootstrap Confidence Intervals
When CLT is unreliable — small n, heavy tails, complex statistics like AUC or F1 — use the bootstrap:
- →Resample B datasets of size n with replacement from your sample.
- →Compute the statistic θ̂⁽ᵇ⁾ on each resample.
- →The empirical distribution of {θ̂⁽ᵇ⁾} approximates the sampling distribution.
Take the α/2 and 1−α/2 quantiles of the bootstrap distribution. Justification: by the Glivenko-Cantelli theorem, the empirical CDF F̂ₙ → F uniformly as n→∞, so the bootstrap distribution converges to the true sampling distribution.
Frequentist CI vs Bayesian Credible Interval
| Frequentist CI | Bayesian Credible Interval | |
|---|---|---|
| What is random | The interval | The parameter |
| Interpretation | 95% of such intervals contain μ | P(μ ∈ interval | data) = 0.95 |
| Requires prior | No | Yes |
| Correct statement | "The procedure captures μ 95% of the time" | "μ is in this interval with 95% probability" |
The Bayesian credible interval is what most people intuitively think a frequentist CI means — but it requires a prior and gives a genuinely probabilistic statement about the parameter.
▸ Interview Answer
A CI is a random interval — the parameter is fixed, the interval varies across samples. 95% CI means the construction procedure captures μ 95% of the time, not that this specific interval has 95% probability. Use z-intervals for large n; t-intervals for small n or unknown σ (heavier tails, n−1 df). Width = 2z·s/√n — to halve it, quadruple n. CIs and two-sided hypothesis tests are duals: reject H₀ at α ↔ μ₀ outside 1−α CI. Bootstrap CIs work for any statistic when CLT fails.
Q&A
Section 15
A/B Testing in Production
The gold standard for measuring causal effects of changes. Connects directly to hypothesis testing and is the primary tool for evaluating ML model changes in production.
Step 1: Define hypothesis
H₀: new model has same conversion rate as old
H₁: new model has different conversion rate
Step 2: Choose metrics
Primary: conversion rate, CTR, revenue (one only)
Guardrail: latency, error rate (must not degrade)
Step 3: Calculate required sample size
n = 2·(z_α + z_β)²·p̄(1-p̄) / δ²
where p̄ = baseline rate, δ = minimum detectable effect
Example: p̄=0.10, δ=0.01, z_α=1.96, z_β=0.84
n = 2 × 7.84 × 0.09 / 0.0001 ≈ 14,112 per group
Step 4: Run experiment
Randomly assign users; same user always gets same variant
Run for full business cycles (at least 2 weeks)
Step 5: Analyze — compute p-value, CI, check guardrails
Step 6: Ship if p < α AND guardrails okCommon Pitfalls
- →Peeking problem: Checking results before planned sample size inflates false positive rate. Checking 5 times at α=0.05 → true FPR ~23%. Fix: pre-register stopping rule, or use sequential testing.
- →Multiple testing: Testing 20 metrics at α=0.05 → expect 1 false positive by chance. Fix: Bonferroni correction (α/k) or use one primary metric.
- →Novelty effect: Users engage more with anything new. Treatment looks better in week 1, returns to baseline by week 3. Fix: run at least 2 full business cycles.
- →Network effects: Users in control and treatment interact. Randomizing individuals violates independence. Fix: cluster randomization.
- →Simpson's paradox: Aggregate results can reverse when segmented due to unequal group sizes. Fix: stratified randomization.
A/B Testing ML Models Specifically
- →Shadow mode first: Run new model in parallel, log both outputs, compare offline — no user impact, catches bugs before live traffic.
- →Canary deployment: Route 5% traffic to new model. Monitor latency, error rate, prediction distribution. Expand if guardrails hold.
- →Holdout groups: Keep permanent 10% holdout on old model to measure long-term impact after full rollout.
- →Offline ≠ online: Offline AUC/F1 on test set doesn't always predict online impact. Always run the A/B test even if offline metrics look great.
▸ Interview Answer
A/B testing requires: pre-specifying hypotheses and metrics before looking at data, calculating sample size for desired power (80%) at chosen significance (α=0.05), randomizing correctly (same user always same variant), running for full business cycles to avoid novelty effects, and reporting confidence intervals not just p-values. The peeking problem is the most common mistake — checking early inflates false positives because you're running multiple implicit tests.
Section 16
Bayesian Inference
A framework for updating beliefs using data. Unlike frequentist statistics (fixed estimates), Bayesian inference gives a full distribution over parameters — capturing uncertainty explicitly.
Frequentist vs Bayesian
| Frequentist | Bayesian |
|---|---|
| Parameters are fixed, unknown constants | Parameters are random variables with distributions |
| Probability = long-run frequency of events | Probability = degree of belief |
| Point estimates + confidence intervals | Full posterior distribution |
| P(data | θ) — likelihood only | P(θ | data) ∝ P(data | θ) · P(θ) |
| No prior knowledge incorporated | Prior encodes existing knowledge |
| 95% CI: 95% of intervals contain θ | 95% credible interval: P(θ in interval | data) = 0.95 |
Posterior ∝ Likelihood × Prior
P(θ | X) = P(X | θ) · P(θ) / P(X) ← P(X) is normalizing constant
Sequential update:
Day 1: P(θ | X₁) ∝ P(X₁|θ) · P(θ)
Day 2: P(θ | X₁,X₂) ∝ P(X₂|θ) · P(θ|X₁)
Bayesian inference is naturally online and sequential.Conjugate Priors
A conjugate prior gives a posterior with the same functional form — enabling closed-form updates without numerical integration.
BETA–BERNOULLI (most important for A/B testing):
Prior: θ ~ Beta(α, β)
Posterior: θ|X ~ Beta(α + successes, β + failures)
Example: Prior Beta(2,20) → 3 frauds in 50 audits
Posterior: Beta(5, 67), mean = 5/72 = 6.9%
GAMMA–POISSON:
Prior: λ ~ Gamma(α, β)
Posterior: λ|X ~ Gamma(α + Σxᵢ, β + n)
GAUSSIAN–GAUSSIAN:
Prior: μ ~ N(μ₀, σ₀²)
Posterior: μ|X ~ N(μₙ, σₙ²)
μₙ = weighted average of prior mean and data meanBayesian A/B Testing
After observing data:
A: 1000 visitors, 100 conversions → θ_A ~ Beta(101, 901)
B: 1000 visitors, 120 conversions → θ_B ~ Beta(121, 881)
Compute P(θ_B > θ_A | data) by Monte Carlo:
samples_A = np.random.beta(101, 901, size=100000)
samples_B = np.random.beta(121, 881, size=100000)
P_B_better = (samples_B > samples_A).mean()
# ≈ 0.87 → "B is better with 87% probability"
Also compute:
Expected lift = (samples_B - samples_A).mean()
P(lift > 1%) = (samples_B - samples_A > 0.01).mean()
Advantages over frequentist:
Direct probability statement: "B is better with 87% probability"
No sample size pre-specification needed
Can stop early with full probability interpretationMAP — Maximum A Posteriori
▸ Interview Answer
Bayesian inference treats parameters as distributions, not fixed unknowns. Start with a prior belief P(θ), observe data, update to posterior P(θ|data) ∝ P(data|θ)·P(θ). Conjugate priors (Beta-Bernoulli, Gaussian-Gaussian) give closed-form posteriors. Bayesian A/B testing gives direct probability statements ("variant B is better with 87% probability") unlike frequentist testing. MAP = MLE with regularization — Gaussian prior = L2, Laplace prior = L1.
Section 17
Interview Q&A — Quick Reference
Practice answering each in under 90 seconds.
Q1
What is the difference between p-value and confidence interval?
A p-value answers: 'Is the effect statistically different from zero?' (yes/no). A confidence interval answers: 'How large is the effect and how uncertain are we?' CI is more informative — a significant p-value on its own doesn't tell you if the effect is practically meaningful. Always report both: 'The conversion rate increased by 1.2% (95% CI: 0.3%–2.1%, p=0.008)'.
Q2
How do you determine sample size for an A/B test?
Specify: significance level α (usually 0.05), desired power 1-β (usually 0.80), baseline rate p̄, and minimum detectable effect δ. Formula: n = 2(z_α + z_β)²p̄(1-p̄)/δ². The MDE is the key practical decision — detecting a 1% lift requires far fewer samples than detecting 0.1%. Smaller effects require exponentially more samples.
Q3
What is the difference between MLE and MAP?
MLE maximizes the likelihood P(data|θ) — no prior. MAP maximizes the posterior P(θ|data) ∝ P(data|θ)·P(θ) — includes a prior. MAP is MLE with regularization: a Gaussian prior gives L2 regularization, a Laplace prior gives L1. With infinite data both converge to the same answer. With limited data, MAP gives better-regularized estimates.
Q4
When would you use Bayesian A/B testing instead of frequentist?
Bayesian when: you want direct probability statements about which variant is better, you have strong prior knowledge to incorporate, you need to stop early with valid inference, or you want to quantify expected lift not just significance. Frequentist when: you need strict Type I error control for regulatory purposes, the team is more familiar with p-values, or you're running many simultaneous tests where error rate control is critical.
Q5
Explain the Central Limit Theorem and why it matters.
CLT states that the sample mean X̄ of n iid random variables with mean μ and variance σ² converges in distribution to N(μ, σ²/n) as n→∞, regardless of the original distribution. This is why hypothesis tests use normal or t distributions even for non-normal data (large n), confidence intervals are symmetric, and MLE estimates are approximately Gaussian for large samples. With n≥30 you can almost always use the normal approximation.
Section 18
Quick Reference Cheat Sheet
Distributions
- ▸Bernoulli(p): single binary. E=p, Var=p(1-p)
- ▸Binomial(N,p): count in N trials. E=Np, Var=Np(1-p)
- ▸Poisson(λ): rare events. E=λ, Var=λ (both equal λ)
- ▸Gaussian(μ,σ²): bell curve. 68-95-99.7 rule
Key Formulas
- ▸Bayes: P(A|B) = P(B|A)·P(A) / P(B)
- ▸MLE: θ* = argmax Σᵢ log P(xᵢ|θ)
- ▸MAP: θ* = argmax Σᵢ log P(xᵢ|θ) + log P(θ)
- ▸95% CI for mean: x̄ ± 1.96·σ/√n
- ▸95% CI for proportion: p̂ ± 1.96·√(p̂(1-p̂)/n)
A/B Test Checklist
- ▸Pre-register: hypothesis, metric, n, α — before peeking
- ▸One primary metric (avoid multiple testing)
- ▸Minimum 2 business cycles (novelty effect)
- ▸Same user → same variant always (hash on user_id)
- ▸Report CI + effect size, not just p-value
- ▸Check guardrail metrics (latency, errors)
Bayesian One-Liners
- ▸Prior + Likelihood → Posterior
- ▸Beta prior + Bernoulli likelihood → Beta posterior
- ▸MAP = MLE + regularization
- ▸Gaussian prior = L2 regularization
- ▸Laplace prior = L1 regularization
- ▸Bayesian CI = credible interval: P(θ in interval | data) = 95%
Hypothesis Testing
- ▸Type I (α): false positive — reject true H₀
- ▸Type II (β): false negative — miss real effect
- ▸Power = 1-β: target ≥ 0.80
- ▸p-value ≠ P(H₀ is true)
- ▸Fail to reject H₀ ≠ accept H₀
Common Mistakes
- ▸Base rate fallacy → ignoring the prior
- ▸Peeking → inflates false positive rate
- ▸p-value misinterpretation → not P(H₀ true)
- ▸CI misinterpretation → not a probability about one interval
- ▸Novelty effect → run 2+ business cycles