> study notes · amazon applied scientist ii

Probability & Statistics

The mathematical foundations everything in ML is built on — Bayes' theorem, key distributions, MLE, hypothesis testing, confidence intervals, A/B testing, and Bayesian inference.

Bayes' Theorem
Distributions
MLE
Hypothesis Testing
Confidence Intervals
A/B Testing
Bayesian Inference
MAP

> Table of Contents

01Conditional Probability 02Bayes' Theorem 03Independence 04PDF and CDF 05Key Distributions 06Concentration Inequalities 07Law of Large Numbers 08Moment Generating Functions 09Central Limit Theorem 10Random Sampling & Estimation 11Information Theory 12Maximum Likelihood (MLE)13Hypothesis Testing 14Confidence Intervals 15A/B Testing 16Bayesian Inference 17Interview Q&A 18Cheat Sheet

Section 1

Conditional Probability

A and B are two events (outcomes) of a random experiment. Conditional probability asks: given that B happened, how likely is A?

Definition

P(A \mid B) = \frac{P(A \cap B)}{P(B)}

P(A \cap B) = P(A \mid B) \cdot P(B) \quad \text{(Multiplication Law)}

The Inverse Problem — P(B|A) from P(A|B)

Given P(A|B), how do you get P(B|A)? Apply the multiplication law to both orderings of the joint probability P(A∩B):

Deriving Bayes' theorem from conditional probability

P(A \cap B) = P(A \mid B) \cdot P(B) = P(B \mid A) \cdot P(A)

P(B \mid A) = \frac{P(A \mid B) \cdot P(B)}{P(A)} \quad \text{(Bayes' theorem)}

Law of Total Probability

If events B₁, B₂, …, Bₙ partition the sample space Ω (they are mutually exclusive and exhaustive: Ω = ∪Bᵢ, Bᵢ ∩ Bⱼ = ∅ for i ≠ j), then:

Law of total probability

P(A) = \sum_i P(A \mid B_i) \cdot P(B_i)

P(A) = P(A \mid B) \cdot P(B) + P(A \mid B^c) \cdot P(B^c) \quad \text{(simplest case)}

P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B \mid A)\cdot P(A) + P(B \mid \neg A)\cdot P(\neg A)}

Why this matters for Bayes' theorem

The law of total probability shows where the denominator P(B) in Bayes' theorem comes from — it's not mysterious, it's just the sum over all ways B could have happened. It acts as a normalizing constant ensuring the posterior probabilities sum to 1.

Section 2

Bayes' Theorem

Describes how to update a belief (probability) when new evidence arrives. The mathematical foundation of Bayesian inference, spam filters, medical diagnosis, and RAG fact verification.

The formula

P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}

P(B) = P(B \mid A)\cdot P(A) + P(B \mid \neg A)\cdot P(\neg A) \quad \text{(law of total probability)}

Worked Example — Supplier Fraud Detection

P(Fraud | Flag) = ?

Known:
  P(Fraud)        = 0.05   ← 5% of suppliers commit fraud (prior)
  P(Flag | Fraud) = 0.90   ← 90% of fraud cases get flagged
  P(Flag | Legit) = 0.10   ← 10% of legit cases also get flagged

Step 1: P(Flag) = P(Flag|Fraud)·P(Fraud) + P(Flag|Legit)·P(Legit)
              = 0.90×0.05 + 0.10×0.95  =  0.045 + 0.095  =  0.14

Step 2: P(Fraud | Flag) = 0.90 × 0.05 / 0.14  =  0.321

Even with a flag, only 32% chance of actual fraud.
Why? Fraud is rare (5%) — most flags are false positives.
This is the base rate fallacy — ignoring the prior destroys inference.

The Base Rate Fallacy

99% accurate test on a rare disease

P(Disease)           = 0.001   (1 in 1000)
P(Positive | Disease) = 0.99
P(Positive | Healthy) = 0.01

P(Disease | Positive) = 0.99×0.001 / (0.99×0.001 + 0.01×0.999)
                      = 0.00099 / 0.01098  =  0.09  =  9%

A 99% accurate test on a rare disease → positive is only 9% likely true.
The rarity of the disease dominates.

▸ Interview Answer

Bayes' theorem updates a prior belief P(A) with new evidence B to get the posterior P(A|B). The key insight: you cannot ignore the prior — rare events remain unlikely even after positive tests (base rate fallacy). In sustainability this applies to supplier fraud detection, emissions claim verification, and any Bayesian risk model.

Section 3

Independence

Two events A and B are independent if knowing one occurred gives no extra information about the other.

Definition — two equivalent forms

P(A \mid B) = P(A) \;\Leftrightarrow\; P(B \mid A) = P(B)

P(A \cap B) = P(A) \cdot P(B) \quad \text{(product rule for independent events)}

Concrete Example — Cards

A = spade, B = queen

P(A) = 13/52 = 1/4    (13 spades in 52 cards)
P(B) = 4/52  = 1/13   (4 queens in 52 cards)

Are they independent? Check:
  P(A ∩ B) = P(queen of spades) = 1/52
  P(A)·P(B) = 1/4 × 1/13 = 1/52  ✓

They match → A and B are independent.
Knowing a card is a queen doesn't change the probability it's a spade.

Independence vs Mutual Exclusivity

Independent: P(A∩B) = P(A)·P(B)

Knowing A happened tells you nothing about B. Both can happen simultaneously.

Mutually exclusive: P(A∩B) = 0

They cannot both happen. Knowing A happened tells you B did NOT happen — the most extreme dependence.

Reliability — Independence in Practice

When components fail independently, the product rule gives exact reliability calculations.

Series vs parallel systems (n=10, p=0.05 per component)

SERIES — all must succeed (chain fails if any link fails):
  P(series success) = (1-p)ⁿ = (0.95)¹⁰ ≈ 0.60
  P(series failure) = 1 - (1-p)ⁿ = 1 - 0.95¹⁰ ≈ 0.40

  Even with 95% reliable components, a chain of 10
  has only 60% chance of working.

PARALLEL — any one succeeding is enough:
  P(parallel failure) = pⁿ = (0.05)¹⁰ ≈ 10⁻¹³
  P(parallel success) = 1 - pⁿ  ≈  1.0

  Redundancy drives failure probability to near zero.

ML Connection

Independence assumptions power most of ML. Naive Bayes classifies text by assuming words are independent given the class. MLE treats training examples as independent draws. Bootstrapping in Random Forests relies on approximate independence between trees. When independence is violated (e.g. time series), standard methods break.

Section 4

PDF and CDF

Two fundamental ways to describe a probability distribution. Every distribution you encounter in ML can be described by both.

Discrete vs Continuous Random Variables

Discrete	Continuous
Takes countable values: 0, 1, 2, …	Takes any value in an interval: x ∈ ℝ
Described by PMF — Probability Mass Function	Described by PDF — Probability Density Function
P(X=k) gives exact probability of each value	P(X=x) = 0 for any single point — only intervals have probability
Example: coin flips, audit failure counts	Example: emissions values, model scores, wait times

PMF — Probability Mass Function (discrete)

PMF — Definition and properties

p(k) = P(X = k),\quad p(k) \geq 0,\quad \sum_k p(k) = 1

PDF — Probability Density Function (continuous)

PDF — Definition and properties

f(x) \geq 0,\quad \int_{-\infty}^{+\infty} f(x)\,dx = 1

P(a \leq X \leq b) = \int_a^b f(x)\,dx \quad \text{(area under the curve)}

f(x) = \frac{1}{\sqrt{2\pi}}\exp\!\left(-\frac{x^2}{2}\right) \quad \text{(Gaussian } N(0,1)\text{)}

CDF — Cumulative Distribution Function

CDF — Definition

F(x) = P(X \leq x)

\text{Discrete:}\; F(x) = \sum_{k \leq x} p(k),\qquad \text{Continuous:}\; F(x) = \int_{-\infty}^{x} f(t)\,dt

f(x) = \frac{dF(x)}{dx},\quad P(a \leq X \leq b) = F(b) - F(a),\quad P(X > a) = 1 - F(a)

CDF in Practice

Z-score and hypothesis testing

\Phi(z) = P(Z \leq z) \quad \text{(Standard Normal CDF)}

Z = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}},\quad p\text{-value (two-tailed)} = 2(1 - \Phi(|Z|))

\Phi^{-1}(0.975) = 1.96 \;\text{(95\% CI)},\quad \Phi^{-1}(0.995) = 2.576 \;\text{(99\% CI)}

Why PDF > 1 is not a problem

A PDF value of 3.0 at some point x just means the distribution is very concentrated there. It's a density, not a probability. The probability of a tiny interval [x, x+dx] is f(x)·dx — and for small enough dx this is always ≤ 1. Only areas (integrals) are probabilities.

Section 5

Key Probability Distributions

Gaussian (Normal) — N(μ, σ²)

Gaussian PDF and properties

f(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

Z = \frac{X - \mu}{\sigma} \sim N(0,1),\qquad \bar{X} \sim N\!\left(\mu,\,\frac{\sigma^2}{n}\right) \;\text{(CLT)}

Gaussian in ML

→Linear/Logistic Regression: assumes Gaussian errors (residuals)
→Gaussian Discriminant Analysis: models P(x|y) as Gaussian per class
→Bayesian inference: Gaussian prior + Gaussian likelihood = Gaussian posterior (conjugate)
→Neural network weights: initialized from N(0, σ²) by default

Bernoulli — Bernoulli(p)

Bernoulli PMF

P(X=x) = p^x(1-p)^{1-x} \quad x \in \{0,1\}

\mathbb{E}[X] = p,\quad \text{Var}(X) = p(1-p)

Binomial — Binomial(N, p)

Binomial PMF

P(X=k) = \binom{N}{k} p^k (1-p)^{N-k}

\mathbb{E}[X] = Np,\quad \text{Var}(X) = Np(1-p)

Poisson — Poisson(λ)

Poisson PMF

P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}

\mathbb{E}[X] = \lambda,\quad \text{Var}(X) = \lambda \quad \text{(both equal }\lambda\text{)}

Geometric — Geometric(p)

Geometric PMF

P(X = k) = (1-p)^{k-1} \cdot p \quad k = 1, 2, 3, \ldots

\mathbb{E}[X] = \frac{1}{p},\quad \text{Var}(X) = \frac{1-p}{p^2}

Memoryless property

P(X > m+n | X > m) = P(X > n). If fraud hasn't been caught in the first m audits, the remaining wait has the same distribution as starting fresh — each trial is independent so the past gives no information about the future.

Exponential — Exp(λ)

The continuous counterpart of the geometric — models the time until the first event in a Poisson process. Where geometric counts discrete trials, exponential measures continuous waiting time.

Exponential PDF and CDF

f(x) = \lambda e^{-\lambda x},\quad F(x) = 1 - e^{-\lambda x} \quad (x \geq 0)

\mathbb{E}[X] = \frac{1}{\lambda},\quad \text{Var}(X) = \frac{1}{\lambda^2}

Memoryless property — unique among continuous distributions

P(X > s+t | X > s) = P(X > t). Having already waited s seconds tells you nothing about how much longer you'll wait. The exponential is the only continuous distribution with this property — the continuous analogue of geometric's memorylessness.

Geometric ↔ Exponential Bridge

Geometric(p)	Exponential(λ)
Discrete — counts trials	Continuous — measures time
PMF: (1-p)^(k-1)·p	PDF: λe^(-λx)
E[X] = 1/p	E[X] = 1/λ
Var(X) = (1-p)/p²	Var(X) = 1/λ²
Memoryless: past trials irrelevant	Memoryless: past wait time irrelevant
Use: trials until first success	Use: wait time between Poisson events

Gamma — Gamma(α, β)

Generalises the exponential — models the waiting time until the α-th event in a Poisson process. Where Exponential(λ) is the wait for the 1st event, Gamma(α, λ) is the wait for the α-th event.

Gamma PDF and properties

f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x} \quad (x \geq 0)

\mathbb{E}[X] = \frac{\alpha}{\beta},\quad \text{Var}(X) = \frac{\alpha}{\beta^2}

\Gamma(n) = (n-1)!\;,\quad \Gamma(1/2) = \sqrt{\pi}\;,\quad \Gamma(\alpha+1) = \alpha\,\Gamma(\alpha)

Special cases

How Gamma generalises other distributions

α = 1:          Gamma(1, β) = Exponential(β)
                ← wait for 1st event = exponential

α = k/2, β = 1/2: Gamma(k/2, 1/2) = Chi-squared(k)
                ← sum of k squared standard normals

α integer:      Erlang distribution
                ← used in queueing theory (call centre wait times)

As α → ∞:       Gamma → Gaussian  (by CLT — sum of α exponentials)

Example — time until 3rd audit failure

Supplier has audit failures at rate β = 2 per year → Poisson(2)
X = time until the 3rd failure  →  X ~ Gamma(α=3, β=2)

E[X] = 3/2 = 1.5 years until the 3rd failure
Var(X) = 3/4 = 0.75,   Std = 0.87 years

Compare to Exponential(2) — wait for 1st failure:
  E[X] = 1/2 = 0.5 years
  Each additional event adds 1/β = 0.5 years on average

Conjugate prior for Poisson and Exponential

Gamma is the conjugate prior for both the Poisson rate λ and the Exponential rate λ. Prior Gamma(α, β) + n Poisson observations summing to Σxᵢ → Posterior Gamma(α + Σxᵢ, β + n). This is why Gamma appears throughout Bayesian inference.

Chi-Squared — χ²(k)

The distribution of the sum of squares of k independent standard normal random variables. It is a special case of Gamma and appears everywhere hypothesis testing touches variance.

Derivation from Normal — step by step

Start with Z ~ N(0,1) — a standard normal.

Step 1: What is the distribution of Z²?

  Z is symmetric around 0, so Z² is always non-negative.
  For z > 0, the CDF of Z² is:
    P(Z² ≤ z) = P(-√z ≤ Z ≤ √z) = 2Φ(√z) - 1

  Differentiate to get the PDF of Z²:
    f(z) = (1/√z) · φ(√z) = (1/√(2πz)) · e^(-z/2)

  This is Gamma(1/2, 1/2) — also written χ²(1).

Step 2: Sum k independent Z²s.

  Let Z₁, Z₂, ..., Zₖ ~ N(0,1) independently.
  Define: X = Z₁² + Z₂² + ... + Zₖ²

  Each Zᵢ² ~ Gamma(1/2, 1/2)
  Sum of k independent Gammas with same β:
    Gamma(α₁,β) + Gamma(α₂,β) = Gamma(α₁+α₂, β)

  Therefore: X ~ Gamma(k/2, 1/2) = χ²(k)  ✓

PDF: f(x) = x^(k/2-1) · e^(-x/2) / (2^(k/2) · Γ(k/2))   for x ≥ 0

Properties

k = degrees of freedom

E[X] = k
Var(X) = 2k

Shape:
  k=1,2:  right-skewed, peak near 0
  k=3-10: moderate right skew
  k→∞:    approaches N(k, 2k)  by CLT

Additivity: χ²(m) + χ²(n) = χ²(m+n)  ← independent sums

Special cases:
  k=1: χ²(1) = Z²  (square of a standard normal)
  k=2: χ²(2) = Exponential(1/2)  (appears in survival analysis)

Where Chi-squared appears in statistics

→Chi-squared test: H₀: no association between categorical variables. Test statistic χ² = Σ(O-E)²/E follows χ²(df) under H₀. Used for contingency tables, goodness-of-fit.
→Sample variance: (n-1)S²/σ² ~ χ²(n-1). This is why hypothesis tests on variance use the chi-squared distribution.
→Confidence interval for σ²: CI for population variance: [(n-1)S²/χ²(α/2), (n-1)S²/χ²(1-α/2)].
→Likelihood ratio tests: -2·log(likelihood ratio) → χ²(df) asymptotically. Standard test for model comparison in ML.

Example — goodness-of-fit test

Are supplier risk categories uniformly distributed?
Observed: Low=45, Medium=30, High=25  (n=100)
Expected under H₀ (uniform): 33.3, 33.3, 33.3

χ² = (45-33.3)²/33.3 + (30-33.3)²/33.3 + (25-33.3)²/33.3
   = 4.11 + 0.33 + 2.07
   = 6.51

df = categories - 1 = 2
Critical value χ²(0.05, df=2) = 5.99

6.51 > 5.99  →  reject H₀  →  distribution is not uniform

Distribution Comparison

Distribution	Models	E[X]	Var(X)
Bernoulli(p)	Single binary outcome	p	p(1-p)
Binomial(N,p)	Count of successes in N trials	Np	Np(1-p)
Geometric(p)	Trials until first success (discrete)	1/p	(1-p)/p²
Poisson(λ)	Count of rare events in interval	λ	λ
Exponential(λ)	Wait time until first event (continuous)	1/λ	1/λ²
Gamma(α,β)	Wait time until α-th event	α/β	α/β²
Chi-squared χ²(k)	Sum of k squared standard normals	k	2k
Gaussian(μ,σ²)	Continuous symmetric outcomes	μ	σ²

Section 6

Concentration Inequalities: Markov & Chebyshev

Concentration inequalities answer: how likely is a random variable to deviate far from its mean? Markov and Chebyshev form the base of a chain that leads all the way to the CLT — each step uses more information about the distribution to get a tighter bound.

Markov's Inequality

For any non-negative random variable X and any constant a > 0:

Markov's Inequality

P(X \geq a) \leq \frac{\mathbb{E}[X]}{a}

Proof

Define the indicator 1{X ≥ a} = 1 if X ≥ a, else 0. Since X ≥ 0 and a > 0:

One-line indicator proof

a \cdot \mathbf{1}\{X \geq a\} \leq X \cdot \mathbf{1}\{X \geq a\} \leq X

\text{Take expectations: } a \cdot P(X \geq a) \leq \mathbb{E}[X] \implies P(X \geq a) \leq \frac{\mathbb{E}[X]}{a}

Geometric intuition

Think of E[X] as the total probability mass spread across the real line. Markov asks: how much of that mass can possibly live at or beyond position a? At most E[X]/a of it. If the mean is 10 and you ask about a = 100, at most 10% of mass can be there. Deliberately loose — it knows only the mean, nothing about the shape. That is the point.

ML applications

→Generalization bounds: bounding P(empirical risk deviates from expected risk) — the starting point for PAC learning theory
→Union bound arguments: combine Markov with union bound to bound any bad event across multiple hypotheses
→Any tail bound where only the mean of a non-negative quantity (loss, error) is known

Chebyshev's Inequality

For any random variable X with finite mean μ and variance σ², and any k > 0:

Chebyshev's Inequality

P(|X - \mu| \geq k) \leq \frac{\text{Var}(X)}{k^2} = \frac{\sigma^2}{k^2}

\text{Equivalently, for } k = c\sigma: \quad P(|X - \mu| \geq c\sigma) \leq \frac{1}{c^2}

Proof — Chebyshev is Markov applied to (X − μ)²

Derivation

\text{Step 1: Apply Markov to } (X-\mu)^2 \text{ with } a = k^2:

P\!\left((X-\mu)^2 \geq k^2\right) \leq \frac{\mathbb{E}[(X-\mu)^2]}{k^2}

\text{Step 2: LHS} = P(|X-\mu| \geq k), \quad \text{Step 3: } \mathbb{E}[(X-\mu)^2] = \sigma^2

\therefore\; P(|X - \mu| \geq k) \leq \frac{\sigma^2}{k^2} \quad \square

Quadratic improvement over Markov

Markov decays as 1/k (linear). Chebyshev decays as 1/k² (quadratic) — each standard deviation of distance you move out, the probability bound drops as the square. Chebyshev also applies to variables that can be negative, unlike Markov. It holds for ANY distribution with finite mean and variance — Gaussian, heavy-tailed, discrete, skewed. The cost of this generality is looseness: for N(0,1), Chebyshev gives P(|X|≥3) ≤ 1/9 ≈ 0.11 versus the true value 0.003.

ML applications

→Confidence intervals without distributional assumptions: any estimator with known variance satisfies Chebyshev bounds
→Bias-variance tradeoff: Var(estimator) controls how concentrated estimates are around the mean
→Dropout as variance reduction: Chebyshev explains why reducing prediction variance improves worst-case guarantees
→Online learning: bounding regret in terms of variance of loss functions

Markov vs Chebyshev — Head-to-Head

Property	Markov	Chebyshev
Requires	X ≥ 0 and E[X]	Finite μ and σ² (any sign)
Bound form	E[X] / a	σ² / k²
Decay rate	1/k (linear)	1/k² (quadratic)
Sign restriction	X must be non-negative	No restriction
Generalizes to	Chernoff bounds (via MGF)	LLN (applied to X̄)

Section 7

Law of Large Numbers

The formal guarantee that more data always helps. As the sample size grows, the sample mean concentrates around the true mean — this is the theoretical foundation of every supervised learning algorithm.

Statement

Let X₁, X₂, …, Xₙ be i.i.d. with mean μ and finite variance σ². Define the sample mean:

WLLN and SLLN

\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i

\textbf{WLLN (Weak Law): } \forall\,\epsilon > 0,\quad P(|\bar{X}_n - \mu| \geq \epsilon) \to 0 \text{ as } n \to \infty

\textbf{SLLN (Strong Law): } P\!\left(\lim_{n\to\infty} \bar{X}_n = \mu\right) = 1

WLLN: convergence in probability. SLLN: almost sure convergence (strictly stronger). For interviews, WLLN is the one to prove.

Proof of WLLN via Chebyshev

3-step proof

\text{Step 1: } \mathbb{E}[\bar{X}_n] = \frac{1}{n}\sum \mathbb{E}[X_i] = \mu

\text{Step 2 (independence, covariance terms vanish): } \text{Var}(\bar{X}_n) = \frac{1}{n^2}\sum \text{Var}(X_i) = \frac{\sigma^2}{n}

\text{Step 3 (Chebyshev on } \bar{X}_n \text{): } P(|\bar{X}_n - \mu| \geq \epsilon) \leq \frac{\sigma^2}{n\,\epsilon^2} \xrightarrow{n\to\infty} 0 \quad \square

The key structural insight

Averaging n i.i.d. variables does two things simultaneously: (1) preserves the mean — E[X̄ₙ] = μ always; (2) shrinks the variance at rate 1/n. This is why more data always helps — not a heuristic, a theorem. The distribution of X̄ₙ does not shift; it collapses onto a single point μ.

ML Applications

→Empirical risk minimization: E_empirical[loss] → E_true[loss] as dataset size grows. The entire foundation of supervised learning.
→Monte Carlo estimation: Averaging n samples of f(X) converges to E[f(X)]. Used everywhere in RL, variational inference, MCMC.
→Mini-batch gradient descent: The mini-batch gradient is an unbiased estimator of the full-batch gradient. LLN says it concentrates as batch size grows.
→Why validation sets work: Empirical accuracy on held-out data converges to true accuracy. LLN is the formal justification.

Section 8

Moment Generating Functions (MGF)

The MGF is the expectation E[e^{tX}], a single function whose n-th derivative at t=0 returns the n-th moment. Because it turns sums of independent variables into products, it drives both the CLT proof and exponentially tight Chernoff bounds.

Definition

MGF

M_X(t) = \mathbb{E}[e^{tX}] = \int_{-\infty}^{\infty} e^{tx} f(x)\,dx

\text{Expanding } e^{tX}: \quad M_X(t) = 1 + t\,\mathbb{E}[X] + \frac{t^2}{2!}\mathbb{E}[X^2] + \frac{t^3}{3!}\mathbb{E}[X^3] + \cdots

M_X^{(n)}(0) = \mathbb{E}[X^n] \quad \text{(n-th derivative at 0 gives the n-th moment)}

Key Properties

→Uniqueness: If two distributions have the same MGF in a neighborhood of 0, they are identical. MGF uniquely identifies a distribution.
→
Independence → multiplication: If X and Y are independent:
$M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$
Convolution in distribution space becomes multiplication in MGF space.
→
Scaling:
$M_{aX+b}(t) = e^{bt} \cdot M_X(at)$

MGFs of Common Distributions

Distribution	MGF M_X(t)
Bernoulli(p)	1 − p + p·eᵗ
Binomial(n, p)	(1 − p + p·eᵗ)ⁿ
Poisson(λ)	exp(λ(eᵗ − 1))
Normal(μ, σ²)	exp(μt + σ²t²/2)
Exponential(λ)	λ/(λ − t), t < λ

Chernoff Bounds — Exponential Tails

Apply Markov to e^{tX} rather than X itself — since e^{tX} is non-negative, Markov applies, and optimizing over t gives exponentially tight tail bounds:

Chernoff bound derivation

P(X \geq a) = P(e^{tX} \geq e^{ta}) \leq \frac{\mathbb{E}[e^{tX}]}{e^{ta}} = \frac{M_X(t)}{e^{ta}} \quad \forall\,t > 0

P(X \geq a) \leq \min_{t > 0}\, M_X(t)\cdot e^{-ta} \quad \text{(optimize over } t \text{)}

Chernoff bounds decay exponentially in a, vs Chebyshev's polynomial 1/a². Tightness ladder: Markov (1/a) ≪ Chebyshev (1/a²) ≪ Chernoff (e^{−ca}).

ML applications

→PAC learning and boosting theory (AdaBoost proof): Chernoff bounds give exponentially tight generalization guarantees
→Log partition function in exponential family models: log M_X(t) — the cumulant generating function — has derivatives that give the mean and variance directly
→Variational inference: KL divergence between exponential family distributions derives from MGF structure

Section 9

Central Limit Theorem

One of the most powerful results in probability theory. It explains why the Gaussian distribution appears everywhere — and why statistical inference works even when you don't know the underlying distribution.

The CLT theorem

\bar{X}_n = \frac{1}{n}\sum_i X_i,\quad \mathbb{E}[X_i]=\mu,\quad \text{Var}(X_i)=\sigma^2

\frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \xrightarrow{d} N(0,1) \quad \text{as } n \to \infty

\bar{X}_n \sim N\!\left(\mu,\,\frac{\sigma^2}{n}\right) \quad \text{(approximately, large }n\text{)}

Why It Matters

→Hypothesis tests (Z-test, t-test) use normal/t distributions even when the data is not normal — valid for large n by CLT
→Confidence intervals are symmetric around the mean — justified by CLT
→MLE estimates are approximately Gaussian for large n — enables asymptotic inference
→With n ≥ 30, you can almost always use the normal approximation in practice

Binomial → Gaussian

A Binomial(n, p) is the sum of n independent Bernoulli(p) random variables. By CLT, this sum converges to a Gaussian as n grows.

The De Moivre–Laplace theorem

X ~ Binomial(n, p)

E[X] = np
Var(X) = np(1-p)

CLT applies because X = X₁ + X₂ + ... + Xₙ, each Xᵢ ~ Bernoulli(p)

As n → ∞:
  X ~ N(np, np(1-p))   approximately

Standardized:
  Z = (X - np) / √(np(1-p))  →  N(0,1)

When does the approximation hold?
  Rule of thumb: np ≥ 5  AND  n(1-p) ≥ 5

Examples:
  n=10,  p=0.5:  np=5  ← borderline, still skewed
  n=30,  p=0.5:  np=15 ← good approximation
  n=100, p=0.1:  np=10 ← good (both conditions met)
  n=100, p=0.01: np=1  ← BAD — use Poisson instead

Visualising the convergence — Binomial(n, 0.5)

n = 5

discrete, lumpy

n = 20

smoother bell

n = 100

nearly Gaussian

As n ↑: discrete mass concentrates into a continuous bell curve

Poisson as a Limit of Binomial

Poisson is not just related to Binomial by analogy — it is exactly the Binomial in the limit n→∞, p→0 with the product np = λ held fixed. Here is the derivation.

Setup

X ~ Binomial(n, p)
P(X = k) = C(n,k) · pᵏ · (1-p)^(n-k)

Let p = λ/n  (so that np = λ is fixed as n → ∞)
Substitute:
P(X = k) = C(n,k) · (λ/n)ᵏ · (1 - λ/n)^(n-k)

Step 1 — expand C(n,k) · (λ/n)ᵏ

C(n,k) · (λ/n)ᵏ = [n! / (k!(n-k)!)] · λᵏ/nᵏ

             = λᵏ/k! · [n(n-1)(n-2)···(n-k+1)] / nᵏ

             = λᵏ/k! · [1 · (1-1/n) · (1-2/n) ··· (1-(k-1)/n)]

As n → ∞, each factor (1 - j/n) → 1  for fixed j, so:

             → λᵏ / k!

Step 2 — handle (1 - λ/n)^(n-k)

(1 - λ/n)^(n-k) = (1 - λ/n)ⁿ · (1 - λ/n)^(-k)

As n → ∞:
  (1 - λ/n)ⁿ  →  e^(-λ)       ← definition of e (or limit of compound interest)
  (1 - λ/n)^(-k) → 1           ← k is fixed, λ/n → 0

Together: (1 - λ/n)^(n-k)  →  e^(-λ)

Step 3 — combine

P(X = k) = C(n,k) · (λ/n)ᵏ · (1 - λ/n)^(n-k)

         →   (λᵏ / k!)  ·  e^(-λ)

         =   e^(-λ) λᵏ / k!   ← Poisson(λ) PMF  ✓

The Binomial(n, λ/n) converges to Poisson(λ) as n → ∞.

The physical interpretation

Divide a time interval into n tiny sub-intervals. In each, an event can happen with probability p = λ/n (small). Sub-intervals are independent. The total count of events is Binomial(n, λ/n). As n→∞ (continuous time, infinitesimally small intervals) with average rate λ fixed, the count converges to Poisson(λ). This is why Poisson models counts of rare independent events — server requests per second, audit violations per year, radioactive decays per minute.

Poisson → Gaussian

A Poisson(λ) can be thought of as the sum of many independent rare events. As λ grows, CLT kicks in and the distribution becomes Gaussian.

Poisson CLT limit

X ~ Poisson(λ)

E[X] = λ,   Var(X) = λ   (both equal λ)

A Poisson(λ) is the limit of Binomial(n, p) as n→∞, p→0, np=λ.
Applying CLT to that Binomial:

As λ → ∞:
  X ~ N(λ, λ)   approximately

Standardized:
  Z = (X - λ) / √λ  →  N(0, 1)

When does the approximation hold?
  Rule of thumb: λ ≥ 10

Examples:
  λ = 1:   very skewed, right tail dominant — CLT has not kicked in
  λ = 5:   still noticeably skewed
  λ = 10:  reasonable approximation
  λ = 30:  excellent approximation, nearly symmetric bell

Why Poisson is right-skewed for small λ

Poisson(λ=1): P(X=0) = 0.368, P(X=1) = 0.368, P(X=2) = 0.184, P(X=3) = 0.061 … Most of the probability mass is near 0, with a long right tail. Mean = Variance = 1. As λ grows, the distribution can spread out symmetrically on both sides — the left tail has room to develop — and the bell shape emerges.

The Three Limits — Summary

Distribution	Limit condition	Approaches	Approximation rule
Binomial(n, p)	n → ∞, p fixed	N(np, np(1-p))	np ≥ 5 and n(1-p) ≥ 5
Binomial(n, p)	n → ∞, p → 0, np = λ	Poisson(λ)	n ≥ 20, p ≤ 0.05
Poisson(λ)	λ → ∞	N(λ, λ)	λ ≥ 10

Proof via MGFs

This is the cleanest proof. It assumes the MGF exists in a neighborhood of 0.

5-step MGF proof

\textbf{Step 1:} \text{ Standardize. Define } Y_i = (X_i-\mu)/\sigma, \text{ so } \mathbb{E}[Y_i]=0,\, \text{Var}(Y_i)=1.

Z_n = \frac{1}{\sqrt{n}}\sum_{i=1}^n Y_i

\textbf{Step 2:} \text{ By independence and scaling, } M_{Z_n}(t) = \left[M_Y\!\left(\tfrac{t}{\sqrt{n}}\right)\right]^n

\textbf{Step 3:} \text{ Taylor expand (since } \mathbb{E}[Y]=0,\,\mathbb{E}[Y^2]=1\text{):}

M_Y(s) = 1 + s\cdot\mathbb{E}[Y] + \tfrac{s^2}{2}\mathbb{E}[Y^2] + O(s^3) = 1 + \tfrac{s^2}{2} + O(s^3)

\textbf{Step 4:} \text{ Substitute } s = t/\sqrt{n}\text{:}\quad M_Y\!\left(\tfrac{t}{\sqrt{n}}\right) = 1 + \frac{t^2}{2n} + O(n^{-3/2})

\textbf{Step 5:} \lim_{n\to\infty} M_{Z_n}(t) = \lim_{n\to\infty}\!\left(1 + \frac{t^2}{2n}\right)^{\!n} = e^{t^2/2}

e^{t^2/2} \text{ is exactly the MGF of } N(0,1)\text{. By MGF uniqueness: } Z_n \xrightarrow{d} N(0,1)\quad\square

What the CLT Does NOT Say

→It does NOT say the distribution of individual Xᵢ is Gaussian — inputs can be anything
→It does NOT say convergence is fast — heavy-tailed distributions may need very large n
→It does NOT apply without finite variance — Cauchy distribution (no finite variance) does not satisfy CLT
→The Berry-Esseen theorem quantifies the finite-n error: sup_z |P(Zₙ ≤ z) − Φ(z)| ≤ C·E[|Y|³]/√n ≈ 0.48/√n

The Full Chain: Markov → CLT

All five results are answers to the same question: how concentrated is a random variable around its mean?

Result	Role in the Chain
Markov	Tail bound from mean alone. Requires only E[X] ≥ 0. Most primitive.
Chebyshev	Markov applied to (X−μ)². Quadratic improvement using variance.
Law of Large Numbers	Chebyshev on X̄ₙ, whose variance is σ²/n → 0. Shows X̄ₙ → μ.
MGF	Encodes all moments. Converts distributional convergence into pointwise function convergence.
CLT	MGF of standardized sum → e^{t²/2}. Fluctuations around μ are always Gaussian at rate 1/√n.

▸ Interview Answer

CLT states that the sample mean of n iid variables with mean μ and variance σ² converges to N(μ, σ²/n) regardless of the original distribution. This is why normal approximations work: Binomial(n,p) → N(np, np(1-p)) when np≥5 and n(1-p)≥5, and Poisson(λ) → N(λ,λ) when λ≥10. In both cases the discrete distribution becomes a continuous bell curve because the sum of many small independent contributions — however distributed — always converges to Gaussian. The MGF proof shows WHY: the MGF of the standardized sum converges pointwise to e^{t²/2}, the unique MGF of the standard normal.

Section 10

Random Sampling & Parameter Estimation

Random sampling is the process of selecting a subset from a population such that every sample has a known, non-zero probability of being selected. It is the mechanism that makes statistical inference possible — allowing you to draw conclusions about a population from a subset.

Why It Matters

Without random sampling, estimates are biased. The classic failure: the 1936 Literary Digest poll predicted Landon over Roosevelt by sampling phone book and car registration owners — systematically missing poor voters. The sample size was 2.4 million. Still wrong. Bias cannot be fixed by more data — only by better sampling.

Simple Random Sampling (SRS)

Every subset of size n from a population of size N is equally likely. Each individual has selection probability n/N. The sample mean is an unbiased estimator of the population mean:

SRS — selection probability and sampling distribution of X̄

P(\text{any sample of size }n) = \frac{1}{\binom{N}{n}}

\mathbb{E}[\bar{X}] = \mu, \qquad \mathrm{Var}(\bar{X}) = \frac{\sigma^2}{n}\!\left(1 - \frac{n}{N}\right)

The factor (1 − n/N) is the finite population correction — it vanishes when n ≪ N, recovering the standard σ²/n from the LLN.

Sample Mean — Estimating μ

Definition and unbiasedness proof

\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i

\mathbb{E}[\bar{x}] = \frac{1}{n}\sum_{i=1}^{n}\mathbb{E}[x_i] = \frac{1}{n}\cdot n\mu = \mu \quad\checkmark

x̄ is an unbiased estimator of μ — no systematic over- or under-estimate.

Sample Variance — Estimating σ² (Bessel's Correction)

The naive estimator dividing by n is biased because x̄ is computed from the same sample — it is closer to each xᵢ than μ is, systematically underestimating spread.

The biased estimator and why it fails

\tilde{s}^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2

(x_i - \bar{x}) = (x_i - \mu) - (\bar{x} - \mu)

\sum_i(x_i-\bar{x})^2 = \sum_i(x_i-\mu)^2 - n(\bar{x}-\mu)^2

\mathbb{E}\!\left[\sum_i(x_i-\bar{x})^2\right] = n\sigma^2 - n\cdot\frac{\sigma^2}{n} = (n-1)\sigma^2

\therefore\;\mathbb{E}[\tilde{s}^2] = \frac{(n-1)\sigma^2}{n} \neq \sigma^2

Bessel's correction — unbiased estimator

s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2

\mathbb{E}[s^2] = \frac{1}{n-1}\cdot(n-1)\sigma^2 = \sigma^2 \quad\checkmark

Dividing by n−1 instead of n corrects for the lost degree of freedom. Once x̄ is fixed, only n−1 of the deviations (xᵢ−x̄) are free — the last is determined by Σ(xᵢ−x̄) = 0.

Geometric intuition: n data points live in ℝⁿ. Computing x̄ projects onto a 1-dimensional subspace — the constraint Σ(xᵢ−x̄) = 0 confines residuals to an (n−1)-dimensional subspace. Variance is computed in this subspace, so we divide by its true dimension n−1.

Standard Error of the Mean

The variance of x̄ as an estimator — how much it fluctuates across different samples:

SE and its sampling distribution

\text{SE}(\bar{x}) = \frac{s}{\sqrt{n}}

\frac{\bar{x}-\mu}{\text{SE}(\bar{x})} \;\sim\; t_{n-1} \;\xrightarrow{n\to\infty}\; \mathcal{N}(0,1)

For small n the t-distribution accounts for extra uncertainty from estimating σ with s — heavier tails than Gaussian, converging to Gaussian as n grows.

Summary

Quantity	Population (unknown)	Sample estimator	Biased?
Mean	$μ$	$\bar{x}=\tfrac{1}{n}\textstyle\sum x_i$	No
Variance	$σ²$	$s^2=\tfrac{1}{n-1}\textstyle\sum(x_i-\bar{x})^2$	No
Std dev	$σ$	$s=\sqrt{s^2}$	Slightly yes*
Std error	$\sigma/\sqrt{n}$	$s/\sqrt{n}$	No

*s is slightly biased for σ because E[√s²] ≠ √E[s²] by Jensen's inequality — bias is O(1/n) and negligible in practice.

▸ Interview Answer

Bias cannot be fixed with more data — only with better sampling. x̄ is unbiased for μ. s² divides by n−1 (Bessel's correction) because x̄ is estimated from the same data, consuming one degree of freedom. The SE = s/√n measures how precisely x̄ estimates μ; divided by SE, x̄ follows a t-distribution for small n and converges to standard normal as n→∞ by CLT.

Section 11

Information Theory for ML

The mathematical backbone of cross-entropy loss, KL divergence, decision tree splits, and feature selection. Understanding these concepts lets you derive why standard ML losses are the right ones, not just memorize them.

1. Information Content (Surprise)

Information content

I(x) = -\log p(x)

Three axioms uniquely determine this formula: certain events carry zero information I(1)=0; rarer events carry more; independent events are additive I(p₁·p₂) = I(p₁)+I(p₂). Only −log p satisfies all three. Log base 2 → bits. Natural log → nats.

Event	p	Surprise (bits)
Fair coin heads	0.5	1.0
Rolling a 6	0.167	2.58
1-in-1000 event	0.001	9.97
Certain event	1.0	0.0

2. Entropy — Average Surprise

Shannon entropy

H(P) = -\sum_x p(x)\log p(x) = \mathbb{E}_P[-\log p(x)]

\text{Fair coin: } H = -(0.5\log_2 0.5 + 0.5\log_2 0.5) = 1.0 \text{ bit}

\text{Biased coin (99/1): } H = -(0.99\log_2 0.99 + 0.01\log_2 0.01) \approx 0.08 \text{ bits}

→H(P) ≥ 0 always
→H(P) = 0 when outcome is certain
→H(P) maximized when distribution is uniform — maximum uncertainty
→Irreducible: you cannot compress data below H(P) (Shannon's source coding theorem)

3. Cross-Entropy — Cost of Being Wrong

Cross-entropy

H(P, Q) = -\sum_x p(x)\log q(x)

\text{One-hot P collapses to: } H(P,Q) = -\log q(y_{\text{true}})

Measures average surprise when model Q encodes events from true distribution P. Equal to H(P) only when Q=P exactly. Three equivalent justifications for using it as the classification loss:

→Information theory: Minimizes bits wasted encoding reality with your model.
→Maximum likelihood: Minimizing cross-entropy = maximizing likelihood — same objective, two framings.
→Gradient: Gradient of cross-entropy w.r.t. logits after softmax is simply (ŷ − y) — clean and stable.

Cross-entropy = MLE

\mathcal{L} = \prod_i q(y_i) \;\Rightarrow\; \log\mathcal{L} = \sum_i \log q(y_i) \;\Rightarrow\; \text{NLL} = -\sum_i \log q(y_i) = H(P,Q)

4. NLL — Negative Log-Likelihood

NLL

\text{NLL} = -\log P(D|\theta) = -\sum_{i=1}^{N}\log q_\theta(y_i|x_i)

→NLL = Cross-Entropy for classification: Identical formula — MLE framing vs information theory framing.
→NLL = MSE for regression: Under Gaussian noise assumption: taking the log of the Gaussian PDF and negating yields MSE plus a constant. If residuals aren't Gaussian, MSE is the wrong loss.

NLL = MSE under Gaussian assumption

q_\theta(y_i|x_i) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\left(-\frac{(y_i-\hat{y}_i)^2}{2\sigma^2}\right)

\text{NLL} = \frac{N}{2}\log(2\pi\sigma^2) + \frac{1}{2\sigma^2}\sum_{i=1}^{N}(y_i-\hat{y}_i)^2

5. KL Divergence — Distance Between Distributions

KL divergence

D_{KL}(P\|Q) = \sum_x p(x)\log\frac{p(x)}{q(x)} = H(P,Q) - H(P)

\min H(P,Q) \;\equiv\; \min D_{KL}(P\|Q) \quad \text{(H(P) is constant during training)}

Extra surprise from using Q instead of P. Not symmetric: D_KL(P‖Q) ≠ D_KL(Q‖P) — not a true distance metric.

Asymmetry proof — P=[0.99,0.01], Q=[0.5,0.5]

D_{KL}(P\|Q) = 0.99\log\tfrac{0.99}{0.5} + 0.01\log\tfrac{0.01}{0.5} \approx 0.637 \text{ nats}

D_{KL}(Q\|P) = 0.5\log\tfrac{0.5}{0.99} + 0.5\log\tfrac{0.5}{0.01} \approx 1.615 \text{ nats}

→min D_KL(P‖Q): Q covers all modes of P — mean seeking, spreads mass everywhere P has mass.
→min D_KL(Q‖P): Q collapses onto one mode of P — mode seeking, ignores regions where Q is near zero. VAEs minimize this direction.

6. Conditional Entropy & Information Gain

Conditional entropy

H(Y|X) = H(X,Y) - H(X)

\text{IG}(X) = H(Y) - H(Y|X)

Decision trees: at each node pick the feature X that maximizes IG — maximum label uncertainty removed. H(Y|X)=0 when X fully determines Y; H(Y|X)=H(Y) when X tells you nothing.

7. Mutual Information

Mutual information

I(X;Y) = H(X) - H(X|Y) = H(X)+H(Y)-H(X,Y) = \sum_{x,y}p(x,y)\log\frac{p(x,y)}{p(x)p(y)}

→I(X;Y) ≥ 0 always
→I(X;Y) = 0 iff X and Y are independent — joint factorizes into product of marginals
→I(X;Y) = I(Y;X) — symmetric, unlike KL
→I(X;X) = H(X) — a variable shares all information with itself

Pearson Correlation

Measures linear relationship between X and Y:

Pearson ρ

\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_X\sigma_Y} = \frac{\sum_{i=1}^{N}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum(x_i-\bar{x})^2}\sqrt{\sum(y_i-\bar{y})^2}}

→Range: [−1, 1] — ρ=1 perfect positive linear, ρ=0 no linear relationship (nonlinear dependence can still exist)
→Assumes continuous variables; sensitive to outliers

Spearman Correlation

Pearson applied to ranks instead of raw values:

Spearman ρₛ

\rho_s = 1 - \frac{6\sum_{i=1}^{N}d_i^2}{N(N^2-1)}, \quad d_i = \text{rank}(x_i) - \text{rank}(y_i)

→Catches any monotonic relationship — if X goes up, Y consistently goes up or down, regardless of linearity
→Robust to outliers (ranks are bounded); works on ordinal data
→Example: Y = X³ — perfectly monotonic, ρₛ = 1.0, but Pearson ρ < 1 because the relationship is nonlinear

Feature selection comparison

Method	Formula	Detects	Categorical?	Outlier Robust?
Pearson	Cov(X,Y) / σₓσᵧ	Linear only	No	No
Spearman	Pearson on ranks	Monotonic	Ordinal only	Yes
Mutual Information	Σ p(x,y) log p(x,y)/p(x)p(y)	Any dependency	Yes	Yes

8. Label Smoothing — Entropy as Regularization

Label smoothing

\tilde{p}(k) = (1-\varepsilon)\cdot\mathbf{1}[k=y] + \frac{\varepsilon}{K}

\mathcal{L} = (1-\varepsilon)\cdot H(P_{\text{hard}},Q) + \varepsilon\cdot H(P_{\text{uniform}},Q)

Hard one-hot targets have H(P)=0 — cross-entropy pushes logits toward infinity, causing overconfidence. Label smoothing (ε=0.1) softens targets: [0,0,1,0] → [0.025, 0.025, 0.925, 0.025]. The second term directly penalizes overconfident predictions — regularization through entropy. Prevents logits growing unbounded, improves calibration, reduces train/inference gap.

Everything connected in one line

H(P,Q) = H(P) + D_{KL}(P\|Q)

Cross-entropy = irreducible uncertainty + cost of model being wrong. Training minimizes the second term. H(P) is the floor you can never beat.

Section 12

Maximum Likelihood Estimation (MLE)

Finds the parameter values that make the observed data most probable. The foundation of most ML training — log loss, MSE, and cross-entropy all derive from MLE.

MLE core idea

\theta^* = \arg\max_\theta P(X \mid \theta)

L(\theta) = \prod_i P(x_i \mid \theta) \quad \text{(likelihood)}

\ell(\theta) = \sum_i \log P(x_i \mid \theta) \quad \text{(log-likelihood)}

MLE for Gaussian — Deriving Mean and Variance

MLE for Gaussian

\mu_{\text{MLE}} = \frac{1}{n}\sum_i x_i = \bar{x}

\sigma^2_{\text{MLE}} = \frac{1}{n}\sum_i (x_i - \bar{x})^2 \quad \text{(biased; unbiased uses }n{-}1\text{)}

MLE for Bernoulli — Why Log Loss

MLE for Bernoulli → binary cross-entropy

P(y_i \mid x_i) = \hat{y}_i^{y_i}(1-\hat{y}_i)^{1-y_i} \quad \text{(Bernoulli)}

\ell(w) = \sum_i \bigl[y_i \log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\bigr]

\text{Loss} = -\frac{1}{n}\sum_i \bigl[y_i \log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\bigr] = \text{Binary Cross-Entropy}

Properties of MLE

→Consistency: As n→∞, θ_MLE → θ_true. Converges to truth with more data.
→Asymptotic normality: For large n, θ_MLE ~ N(θ_true, I(θ)⁻¹) where I is Fisher information.
→Efficiency: Achieves the Cramér-Rao lower bound — lowest possible variance among unbiased estimators.
→Invariance: If θ_MLE is MLE of θ, then g(θ_MLE) is MLE of g(θ).

▸ Interview Answer

MLE finds parameters that maximize the probability of observing the training data. It connects directly to loss functions: MLE on Gaussian data gives MSE loss, MLE on Bernoulli data gives log loss. Log-likelihood is used in practice because products of probabilities underflow numerically and logs turn products into sums. MLE is consistent and asymptotically efficient but can overfit on small datasets — adding a prior gives MAP estimation.

Section 13

Hypothesis Testing

A formal framework for deciding whether observed data provides enough evidence to reject a default assumption. The foundation of A/B testing and scientific inference.

The 5-step framework

Step 1: State hypotheses
  H₀ (null):        default assumption — no effect, no difference
  H₁ (alternative): what you want to prove

Step 2: Choose significance level α
  α = P(reject H₀ | H₀ is true) = false positive rate
  Standard: α = 0.05

Step 3: Compute test statistic
  Summarize data into a single number.
  Under H₀ this statistic follows a known distribution.

Step 4: Compute p-value
  p-value = P(observing data this extreme | H₀ is true)
  Small p-value → data unlikely under H₀ → evidence against H₀

Step 5: Decision
  p-value < α  → reject H₀  (statistically significant)
  p-value ≥ α  → fail to reject H₀  (not enough evidence)

IMPORTANT: failing to reject H₀ ≠ proving H₀

Type I and Type II Errors

	Type I Error (α)	Type II Error (β)
Also called	False Positive	False Negative
Definition	Reject H₀ when H₀ is true	Fail to reject H₀ when H₁ is true
Probability	α (you choose this)	β (depends on effect size, n)
Fix	Lower α	Increase sample size
Example	Flag clean supplier as fraudulent	Miss fraudulent supplier

Power = 1 − β

P(correctly rejecting H₀ when H₁ is true). Increases with larger n, larger true effect size, lower data variance. Standard: aim for power ≥ 0.80.

Common Test Statistics

Key test statistics

Z = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}} \sim N(0,1) \quad \text{(Z-test, known }\sigma\text{)}

t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} \sim t(n-1) \quad \text{(t-test, unknown }\sigma\text{)}

t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{s_1^2/n_1 + s_2^2/n_2}} \quad \text{(two-sample t-test)}

Z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})(1/n_1+1/n_2)}},\quad \hat{p} = \frac{x_1+x_2}{n_1+n_2} \quad \text{(proportions)}

\chi^2 = \sum \frac{(\text{observed}-\text{expected})^2}{\text{expected}} \sim \chi^2(\text{df}) \quad \text{(chi-squared)}

p-value Misconception

p-value is NOT the probability that H₀ is true. It is P(data this extreme | H₀ is true). A p-value of 0.03 does not mean 97% chance the effect is real — it means the data would occur 3% of the time if H₀ were true.

Section 14

Confidence Intervals

A confidence interval is a random interval constructed from sample data that contains the true population parameter with a specified probability — the confidence level 1−α.

Correct vs Wrong Interpretation

✗ Wrong

"There is a 95% probability that μ is in this interval." — μ is fixed (not random). This interval either contains it or it doesn't.

✓ Correct

If you repeated the experiment infinitely many times, 100(1−α)% of the constructed intervals would contain the true μ. The interval is random, not the parameter.

Construction from CLT

For large n, by CLT the standardized sample mean is approximately standard normal. We find zᵉ⁄₂ such that the central probability equals 1−α, then rearrange to isolate μ:

Derivation — isolating μ

\frac{\bar{x}-\mu}{s/\sqrt{n}} \sim \mathcal{N}(0,1)

P\!\left(-z_{\alpha/2} \leq \frac{\bar{x}-\mu}{s/\sqrt{n}} \leq z_{\alpha/2}\right) = 1-\alpha

\boxed{\text{CI} = \bar{x} \pm z_{\alpha/2}\frac{s}{\sqrt{n}}}

Worked Example — NFL Punter

A punter historically averages 41 yards/punt with SD 8 yards. What is the probability their next 40 punts average at least 45 yards?

\bar{X}_{40} = \frac{1}{40}\sum_{i=1}^{40}X_i, \quad \mathbb{E}[\bar{X}_{40}]=41, \quad \text{SD}(\bar{X}_{40})=\frac{8}{\sqrt{40}}

P(\bar{X}_{40}\geq 45) = P\!\left(\frac{\bar{X}_{40}-41}{8/\sqrt{40}} \geq \frac{45-41}{8/\sqrt{40}}\right) = P(Z \geq 3.16)

= 1 - \Phi(3.16) \approx 7.8\times10^{-4} \quad \text{(1 in 1\,000 chance)}

The same standardization step — divide by SE = σ/√n, subtract μ — is exactly what CI construction inverts: instead of computing a probability from a fixed x̄, the CI finds all μ values consistent with the observed x̄.

Critical values for common levels

Confidence Level	α	zᵉ⁄₂
90%	0.10	1.645
95%	0.05	1.960
99%	0.01	2.576

Small Samples — t-Distribution

When n is small, σ is unknown and estimated by s — adding uncertainty. The statistic follows a t-distribution with n−1 degrees of freedom:

t-interval (small n or unknown σ)

\frac{\bar{x}-\mu}{s/\sqrt{n}} \sim t_{n-1}

\text{CI} = \bar{x} \pm t_{\alpha/2,\,n-1}\frac{s}{\sqrt{n}}

tₙ₋₁ has heavier tails than N(0,1) — reflecting the extra uncertainty from estimating σ with s. As n→∞, tₙ₋₁ → N(0,1) and the two intervals converge.

Width of the CI — Three Levers

Width formula

\text{Width} = 2\,z_{\alpha/2}\frac{s}{\sqrt{n}}

→Confidence level ↑ → z_{α/2} larger → wider interval. More confidence costs precision.
→Sample size n ↑ → width shrinks at 1/√n. To halve the width, quadruple n.
→Population variance σ² ↑ → wider interval. More variable population needs more data.

Connection to Hypothesis Testing — Duality

CI and hypothesis testing are dual formulations of the same question. A two-sided test of H₀: μ = μ₀ at level α rejects iff μ₀ falls outside the 1−α CI:

Reject H₀ ↔ μ₀ outside CI

\text{Reject }H_0 \iff \mu_0 \notin \left[\bar{x} - z_{\alpha/2}\frac{s}{\sqrt{n}},\; \bar{x} + z_{\alpha/2}\frac{s}{\sqrt{n}}\right]

\iff |z| = \frac{|\bar{x}-\mu_0|}{s/\sqrt{n}} > z_{\alpha/2} \iff p\text{-value} < \alpha

The CI is the set of all μ₀ values you would fail to reject at level α. They encode identical information — different presentations of the same inference.

Bootstrap Confidence Intervals

When CLT is unreliable — small n, heavy tails, complex statistics like AUC or F1 — use the bootstrap:

→Resample B datasets of size n with replacement from your sample.
→Compute the statistic θ̂⁽ᵇ⁾ on each resample.
→The empirical distribution of {θ̂⁽ᵇ⁾} approximates the sampling distribution.

Percentile bootstrap CI

\text{CI} = \left[\hat{\theta}^{(\alpha/2 \cdot B)},\; \hat{\theta}^{((1-\alpha/2)\cdot B)}\right]

Take the α/2 and 1−α/2 quantiles of the bootstrap distribution. Justification: by the Glivenko-Cantelli theorem, the empirical CDF F̂ₙ → F uniformly as n→∞, so the bootstrap distribution converges to the true sampling distribution.

Frequentist CI vs Bayesian Credible Interval

	Frequentist CI	Bayesian Credible Interval
What is random	The interval	The parameter
Interpretation	95% of such intervals contain μ	P(μ ∈ interval \| data) = 0.95
Requires prior	No	Yes
Correct statement	"The procedure captures μ 95% of the time"	"μ is in this interval with 95% probability"

The Bayesian credible interval is what most people intuitively think a frequentist CI means — but it requires a prior and gives a genuinely probabilistic statement about the parameter.

▸ Interview Answer

A CI is a random interval — the parameter is fixed, the interval varies across samples. 95% CI means the construction procedure captures μ 95% of the time, not that this specific interval has 95% probability. Use z-intervals for large n; t-intervals for small n or unknown σ (heavier tails, n−1 df). Width = 2z·s/√n — to halve it, quadruple n. CIs and two-sided hypothesis tests are duals: reject H₀ at α ↔ μ₀ outside 1−α CI. Bootstrap CIs work for any statistic when CLT fails.

Q&A

Section 15

A/B Testing in Production

The gold standard for measuring causal effects of changes. Connects directly to hypothesis testing and is the primary tool for evaluating ML model changes in production.

Full pipeline

Step 1: Define hypothesis
  H₀: new model has same conversion rate as old
  H₁: new model has different conversion rate

Step 2: Choose metrics
  Primary: conversion rate, CTR, revenue (one only)
  Guardrail: latency, error rate (must not degrade)

Step 3: Calculate required sample size
  n = 2·(z_α + z_β)²·p̄(1-p̄) / δ²
  where p̄ = baseline rate, δ = minimum detectable effect

  Example: p̄=0.10, δ=0.01, z_α=1.96, z_β=0.84
  n = 2 × 7.84 × 0.09 / 0.0001 ≈ 14,112 per group

Step 4: Run experiment
  Randomly assign users; same user always gets same variant
  Run for full business cycles (at least 2 weeks)

Step 5: Analyze — compute p-value, CI, check guardrails
Step 6: Ship if p < α AND guardrails ok

Common Pitfalls

→Peeking problem: Checking results before planned sample size inflates false positive rate. Checking 5 times at α=0.05 → true FPR ~23%. Fix: pre-register stopping rule, or use sequential testing.
→Multiple testing: Testing 20 metrics at α=0.05 → expect 1 false positive by chance. Fix: Bonferroni correction (α/k) or use one primary metric.
→Novelty effect: Users engage more with anything new. Treatment looks better in week 1, returns to baseline by week 3. Fix: run at least 2 full business cycles.
→Network effects: Users in control and treatment interact. Randomizing individuals violates independence. Fix: cluster randomization.
→Simpson's paradox: Aggregate results can reverse when segmented due to unequal group sizes. Fix: stratified randomization.

A/B Testing ML Models Specifically

→Shadow mode first: Run new model in parallel, log both outputs, compare offline — no user impact, catches bugs before live traffic.
→Canary deployment: Route 5% traffic to new model. Monitor latency, error rate, prediction distribution. Expand if guardrails hold.
→Holdout groups: Keep permanent 10% holdout on old model to measure long-term impact after full rollout.
→Offline ≠ online: Offline AUC/F1 on test set doesn't always predict online impact. Always run the A/B test even if offline metrics look great.

▸ Interview Answer

A/B testing requires: pre-specifying hypotheses and metrics before looking at data, calculating sample size for desired power (80%) at chosen significance (α=0.05), randomizing correctly (same user always same variant), running for full business cycles to avoid novelty effects, and reporting confidence intervals not just p-values. The peeking problem is the most common mistake — checking early inflates false positives because you're running multiple implicit tests.

Section 16

Bayesian Inference

A framework for updating beliefs using data. Unlike frequentist statistics (fixed estimates), Bayesian inference gives a full distribution over parameters — capturing uncertainty explicitly.

Frequentist vs Bayesian

Frequentist	Bayesian
Parameters are fixed, unknown constants	Parameters are random variables with distributions
Probability = long-run frequency of events	Probability = degree of belief
Point estimates + confidence intervals	Full posterior distribution
P(data \| θ) — likelihood only	P(θ \| data) ∝ P(data \| θ) · P(θ)
No prior knowledge incorporated	Prior encodes existing knowledge
95% CI: 95% of intervals contain θ	95% credible interval: P(θ in interval \| data) = 0.95

The Bayesian update

Posterior ∝ Likelihood × Prior

P(θ | X) = P(X | θ) · P(θ) / P(X)   ← P(X) is normalizing constant

Sequential update:
  Day 1: P(θ | X₁) ∝ P(X₁|θ) · P(θ)
  Day 2: P(θ | X₁,X₂) ∝ P(X₂|θ) · P(θ|X₁)

Bayesian inference is naturally online and sequential.

Conjugate Priors

A conjugate prior gives a posterior with the same functional form — enabling closed-form updates without numerical integration.

Key conjugate pairs

BETA–BERNOULLI (most important for A/B testing):
  Prior:     θ ~ Beta(α, β)
  Posterior: θ|X ~ Beta(α + successes, β + failures)

  Example: Prior Beta(2,20) → 3 frauds in 50 audits
  Posterior: Beta(5, 67),  mean = 5/72 = 6.9%

GAMMA–POISSON:
  Prior:     λ ~ Gamma(α, β)
  Posterior: λ|X ~ Gamma(α + Σxᵢ, β + n)

GAUSSIAN–GAUSSIAN:
  Prior:     μ ~ N(μ₀, σ₀²)
  Posterior: μ|X ~ N(μₙ, σₙ²)
  μₙ = weighted average of prior mean and data mean

Bayesian A/B Testing

Direct probability statement about which variant is better

After observing data:
  A: 1000 visitors, 100 conversions → θ_A ~ Beta(101, 901)
  B: 1000 visitors, 120 conversions → θ_B ~ Beta(121, 881)

Compute P(θ_B > θ_A | data) by Monte Carlo:
  samples_A = np.random.beta(101, 901, size=100000)
  samples_B = np.random.beta(121, 881, size=100000)
  P_B_better = (samples_B > samples_A).mean()
  # ≈ 0.87 → "B is better with 87% probability"

Also compute:
  Expected lift = (samples_B - samples_A).mean()
  P(lift > 1%) = (samples_B - samples_A > 0.01).mean()

Advantages over frequentist:
  Direct probability statement: "B is better with 87% probability"
  No sample size pre-specification needed
  Can stop early with full probability interpretation

MAP — Maximum A Posteriori

MAP = MLE + regularization

\theta_{\text{MAP}} = \arg\max_\theta P(\theta \mid X) = \arg\max_\theta P(X \mid \theta)\cdot P(\theta)

\text{Gaussian prior} \Rightarrow L_2 \text{ regularization (Ridge)}

\text{Laplace prior} \Rightarrow L_1 \text{ regularization (Lasso)}

▸ Interview Answer

Bayesian inference treats parameters as distributions, not fixed unknowns. Start with a prior belief P(θ), observe data, update to posterior P(θ|data) ∝ P(data|θ)·P(θ). Conjugate priors (Beta-Bernoulli, Gaussian-Gaussian) give closed-form posteriors. Bayesian A/B testing gives direct probability statements ("variant B is better with 87% probability") unlike frequentist testing. MAP = MLE with regularization — Gaussian prior = L2, Laplace prior = L1.

Section 17

Interview Q&A — Quick Reference

Practice answering each in under 90 seconds.

Q1

What is the difference between p-value and confidence interval?

A p-value answers: 'Is the effect statistically different from zero?' (yes/no). A confidence interval answers: 'How large is the effect and how uncertain are we?' CI is more informative — a significant p-value on its own doesn't tell you if the effect is practically meaningful. Always report both: 'The conversion rate increased by 1.2% (95% CI: 0.3%–2.1%, p=0.008)'.

Q2

How do you determine sample size for an A/B test?

Specify: significance level α (usually 0.05), desired power 1-β (usually 0.80), baseline rate p̄, and minimum detectable effect δ. Formula: n = 2(z_α + z_β)²p̄(1-p̄)/δ². The MDE is the key practical decision — detecting a 1% lift requires far fewer samples than detecting 0.1%. Smaller effects require exponentially more samples.

Q3

What is the difference between MLE and MAP?

MLE maximizes the likelihood P(data|θ) — no prior. MAP maximizes the posterior P(θ|data) ∝ P(data|θ)·P(θ) — includes a prior. MAP is MLE with regularization: a Gaussian prior gives L2 regularization, a Laplace prior gives L1. With infinite data both converge to the same answer. With limited data, MAP gives better-regularized estimates.

Q4

When would you use Bayesian A/B testing instead of frequentist?

Bayesian when: you want direct probability statements about which variant is better, you have strong prior knowledge to incorporate, you need to stop early with valid inference, or you want to quantify expected lift not just significance. Frequentist when: you need strict Type I error control for regulatory purposes, the team is more familiar with p-values, or you're running many simultaneous tests where error rate control is critical.

Q5

Explain the Central Limit Theorem and why it matters.

CLT states that the sample mean X̄ of n iid random variables with mean μ and variance σ² converges in distribution to N(μ, σ²/n) as n→∞, regardless of the original distribution. This is why hypothesis tests use normal or t distributions even for non-normal data (large n), confidence intervals are symmetric, and MLE estimates are approximately Gaussian for large samples. With n≥30 you can almost always use the normal approximation.

Section 18

Quick Reference Cheat Sheet

Distributions

▸Bernoulli(p): single binary. E=p, Var=p(1-p)
▸Binomial(N,p): count in N trials. E=Np, Var=Np(1-p)
▸Poisson(λ): rare events. E=λ, Var=λ (both equal λ)
▸Gaussian(μ,σ²): bell curve. 68-95-99.7 rule

Key Formulas

▸Bayes: P(A|B) = P(B|A)·P(A) / P(B)
▸MLE: θ* = argmax Σᵢ log P(xᵢ|θ)
▸MAP: θ* = argmax Σᵢ log P(xᵢ|θ) + log P(θ)
▸95% CI for mean: x̄ ± 1.96·σ/√n
▸95% CI for proportion: p̂ ± 1.96·√(p̂(1-p̂)/n)

A/B Test Checklist

▸Pre-register: hypothesis, metric, n, α — before peeking
▸One primary metric (avoid multiple testing)
▸Minimum 2 business cycles (novelty effect)
▸Same user → same variant always (hash on user_id)
▸Report CI + effect size, not just p-value
▸Check guardrail metrics (latency, errors)

Bayesian One-Liners

▸Prior + Likelihood → Posterior
▸Beta prior + Bernoulli likelihood → Beta posterior
▸MAP = MLE + regularization
▸Gaussian prior = L2 regularization
▸Laplace prior = L1 regularization
▸Bayesian CI = credible interval: P(θ in interval | data) = 95%

Hypothesis Testing

▸Type I (α): false positive — reject true H₀
▸Type II (β): false negative — miss real effect
▸Power = 1-β: target ≥ 0.80
▸p-value ≠ P(H₀ is true)
▸Fail to reject H₀ ≠ accept H₀

Common Mistakes

▸Base rate fallacy → ignoring the prior
▸Peeking → inflates false positive rate
▸p-value misinterpretation → not P(H₀ true)
▸CI misinterpretation → not a probability about one interval
▸Novelty effect → run 2+ business cycles