> study notes · amazon applied scientist ii

Classical Machine Learning

The algorithms that still dominate tabular data and production ML: decision trees, random forests, gradient boosting, SVMs, logistic regression, regularization, bias-variance, cross-validation, and feature engineering.

Linear Regression
Logistic Regression
Decision Trees
Random Forests
XGBoost
LightGBM
SVMs
L1/L2
Bias-Variance
Cross-Validation
Feature Engineering

> Table of Contents

01Linear Regression 02Logistic Regression 03Decision Trees 04Random Forests 05Gradient Boosting 06Support Vector Machines 07Regularization 08Bias-Variance Trade-off 09Cross-Validation 10Feature Engineering 11Classical ML vs Deep Learning 12Cheat Sheet

Section 1

Linear Regression

Models a continuous output as a linear combination of input features. The simplest supervised learning algorithm and the foundation of logistic regression, regularization, and most of the statistical thinking behind ML.

The model

\hat{y} = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b \;=\; w^\top x + b

\text{Simple (1 feature):}\quad \hat{y} = wx + b \;\to\; \text{a straight line}

\text{Multiple (}n\text{ features):}\quad \hat{y} = w^\top x + b \;\to\; \text{hyperplane in }n{+}1\text{ dims}

Loss Function — MSE vs MAE

Loss functions

\text{MSE} = \frac{1}{N} \sum_i (y_i - \hat{y}_i)^2

\text{MAE} = \frac{1}{N} \sum_i |y_i - \hat{y}_i|

What each optimizes toward

	MSE	MAE
Optimal estimator	Mean of target distribution	Median of target distribution
Sensitivity to outliers	High — squares large errors	Low — linear penalty
Gradient behavior	Smooth everywhere	Non-differentiable at zero
Optimization	Easier — smooth gradients	Harder — needs subgradients or IRLS

Gradient behavior matters for linear regression specifically:

MSE gradient (smooth)

\frac{\partial \,\text{MSE}}{\partial \hat{y}} = -\frac{2}{n}(y_i - \hat{y}_i)

MAE gradient is discontinuous at zero — requires subgradient methods or iterative reweighted least squares (IRLS). This makes closed-form solutions unavailable for MAE.

MSE — closed-form normal equation

\hat{\theta} = (X^\top X)^{-1} X^\top y

MAE has no closed-form equivalent.

Huber loss — the practical middle ground

Huber Loss

L_\delta(y, \hat{y}) = \begin{cases} \tfrac{1}{2}(y - \hat{y})^2 & |y - \hat{y}| \leq \delta \\ \delta\,|y - \hat{y}| - \tfrac{1}{2}\delta^2 & \text{otherwise} \end{cases}

Quadratic for small errors like MSE, linear for large errors like MAE. Smooth gradients + outlier robustness. δ is a tunable threshold that defines what counts as a "small" error.

Solving — Two Approaches

Normal Equation vs Gradient Descent

\textbf{Approach 1 — Normal Equation:}\quad w^* = (X^\top X)^{-1} X^\top y

\textbf{Approach 2 — Gradient Descent:}\quad \frac{\partial \text{MSE}}{\partial w} = \frac{-2}{N} X^\top(y - Xw)

w \leftarrow w - \eta \cdot \frac{\partial \text{MSE}}{\partial w}, \qquad b \leftarrow b - \eta \cdot \frac{\partial \text{MSE}}{\partial b}

Normal Equation — Full Derivation

Where X ∈ ℝⁿˣᵈ is the design matrix, y ∈ ℝⁿ the target vector, and θ ∈ ℝᵈ the parameters.

Step 1 — Setup: minimize MSE loss

L(\theta) = \|y - X\theta\|^2 = (y - X\theta)^\top(y - X\theta)

Step 2 — Expand

L(\theta) = y^\top y - y^\top X\theta - (X\theta)^\top y + (X\theta)^\top X\theta

= y^\top y - 2\theta^\top X^\top y + \theta^\top X^\top X\theta

(Since y⊤Xθ is a scalar it equals its own transpose, so the two middle terms merge.)

Step 3 — Gradient w.r.t. θ (using ∂/∂θ(θᵀAθ) = 2Aθ for symmetric A)

\frac{\partial L}{\partial \theta} = -2X^\top y + 2X^\top X\theta

Step 4 — Set to zero → Normal Equations

-2X^\top y + 2X^\top X\theta = 0

X^\top X\theta = X^\top y

Step 5 — Solve (assuming XᵀX invertible)

\boxed{\hat{\theta} = (X^\top X)^{-1} X^\top y}

Geometric intuition

X⊤X is invertible only when X has full column rank — no multicollinearity. Geometrically, ŷ = Xθ̂ is the orthogonal projection of y onto the column space of X. The residual y − Xθ̂ is perpendicular to every column of X, which is exactly what the normal equations encode:

X^\top(y - X\hat{\theta}) = 0

Key Assumptions (LINE)

→Linearity: Relationship between X and y is linear. Check: residual plot should show no pattern.
→Independence: Observations are independent. Violated by time series data → use time series models.
→Homoscedasticity: Constant variance of residuals across all X values. Fix if violated: log-transform the target y.
→No Multicollinearity: Features should not be highly correlated. Check: Variance Inflation Factor (VIF > 10 = problem). Fix: remove one correlated feature, or use Ridge.
→Normality of residuals: Required for valid hypothesis tests on coefficients. Less important for pure prediction tasks.

Evaluation Metrics

Regression metrics

\text{MSE} = \frac{1}{N}\sum(y - \hat{y})^2

\text{RMSE} = \sqrt{\text{MSE}}

\text{MAE} = \frac{1}{N}\sum|y - \hat{y}|

R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}, \quad SS_{\text{res}} = \sum(y-\hat{y})^2,\quad SS_{\text{tot}} = \sum(y-\bar{y})^2

R^2_{\text{adj}} = 1 - \frac{(1-R^2)(N-1)}{N-p-1} \quad (p = \text{num features})

Linear vs Logistic Regression

Linear Regression	Logistic Regression
Predicts a continuous value	Predicts a probability (0 to 1)
Output: ŷ = wᵀx + b (unbounded)	Output: σ(wᵀx + b) ∈ (0, 1)
Loss: Mean Squared Error	Loss: Binary Cross-Entropy
Use: predicting emissions values, revenue	Use: classifying high/low risk suppliers
Assumptions: normality, homoscedasticity	Fewer distributional assumptions

▸ Interview Answer

Linear regression models continuous outputs as a weighted sum of features. The MSE loss is convex so gradient descent always finds the global minimum. For small feature sets the Normal Equation gives an exact closed-form solution. Key diagnostics: check residual plots for linearity, VIF for multicollinearity, and use Adjusted R² when comparing models. For sustainability use cases — predicting supplier emissions values, estimating carbon footprint from product attributes — linear regression with feature engineering is often a strong, interpretable baseline.

Q&A

Section 2

Logistic Regression

A linear classifier that models the probability of a binary outcome. Despite the name it is a classification algorithm. It is the foundation of many neural network concepts and the standard strong baseline.

How it works — the math

z = w^\top x + b

\sigma(z) = \frac{1}{1 + e^{-z}} \;\in\; (0,1) \;=\; P(y=1 \mid x)

L = -\bigl[y \log(\hat{y}) + (1-y)\log(1-\hat{y})\bigr] \quad \text{(Binary Cross-Entropy)}

\frac{\partial L}{\partial w} = (\hat{y} - y) \cdot x

Key Properties

→Linear decision boundary: Can only separate classes with a straight line (hyperplane in N dims). Cannot learn XOR or circular boundaries without feature engineering.
→Probabilistic output: Unlike SVMs, gives calibrated probabilities. Useful when you need P(class), not just the class label.
→Interpretable: Each weight wᵢ tells you how much feature i contributes to the log-odds. Can audit which features drive predictions.
→Fast to train: Convex loss function — gradient descent always finds the global minimum. No local optima.

Decision Boundary

The decision boundary is the set of points where ŷ = 0.5, which means σ(z) = 0.5 → z = 0 → Xw + b = 0. This is a hyperplane in feature space — a line in 2D, a plane in 3D.

Why it's called a linear classifier

\text{Decision boundary: }w^\top x + b = 0 \;\rightarrow\; \text{a hyperplane}

\sigma(w^\top x+b)=0.5 \;\Longleftrightarrow\; w^\top x+b=0

\text{The sigmoid is nonlinear, but the boundary it creates is linear.}

\text{Can separate: linearly separable classes}\quad\checkmark

\text{Cannot separate: XOR, concentric circles, spirals}\quad\times

Multiclass Extension — Softmax

Softmax regression (K > 2 classes)

P(y=k \mid x) = \frac{\exp(w_k^\top x)}{\sum_j \exp(w_j^\top x)}

L = -\sum_k y_k \cdot \log P(y=k \mid x) \quad \text{(Categorical cross-entropy)}

▸ Interview Answer

Logistic regression models P(y=1|x) by applying a sigmoid to a linear combination of features. The log loss penalizes confident wrong predictions. It is convex so always finds the global optimum. I use it as a strong baseline — if it performs well, the problem is likely linearly separable and you don't need a complex model.

Assumptions

Fewer than linear regression — no normally distributed residuals, no homoscedasticity. But it has its own set.

Binary outcome

Dependent variable must be categorical (binary for standard logistic regression). Can't predict continuous values like emissions tonnage.

Independence of observations

Each example must be independent. Violated by time series, clustered data (suppliers within regions), or repeated measures. Fix: mixed effects models or account for the grouping structure.

Linearity of log-odds

The key assumption. Linear relationship between features and the log-odds of the outcome — not the probability directly. Check by plotting log-odds against each continuous feature. Fix by adding polynomial terms or binning if non-linear.

\log\frac{P(y=1)}{P(y=0)} = w^\top x + b \quad\leftarrow\text{log-odds is linear in }x

\text{Probability: S-shaped (sigmoid) in }x \quad\text{Log-odds: linear in }x

No multicollinearity

Correlated features cause unstable coefficient estimates — the model can't determine which deserves credit. Check VIF > 10. Fix: drop one, use Ridge (L2), or PCA.

No extreme outliers

A single extreme point can heavily shift the decision boundary. Check: z-scores (|z| > 3). Fix: remove true errors, robust scaling, or winsorization.

Large enough sample size

Rule of thumb: 10–20 events (minority class samples) per predictor variable. With D features need at least 10D minority class examples. Use L2 regularization to stabilize estimates when N is small.

Assumption	Linear Reg	Logistic Reg
Binary/categorical outcome	✗	✓ required
Independence	✓	✓
Linear relationship	in y	in log-odds
Normally distributed errors	✓	✗ not needed
Homoscedasticity	✓	✗ not needed
No multicollinearity	✓	✓
No extreme outliers	✓	✓
Large N	✓	✓

▸ Interview One-Liner

Logistic regression assumes: binary outcome, independent observations, linearity in the log-odds (not the probability), no severe multicollinearity, no extreme outliers, and sufficient sample size — roughly 10–20 events per feature. It does not require normally distributed residuals or homoscedasticity — those are linear regression assumptions. The most commonly violated in practice is log-odds linearity, which you check by plotting log-odds against each continuous feature and fix by adding polynomial terms.

Why Log Loss? — The MLE Derivation

Log loss isn't an arbitrary choice. It is the unique loss that follows from modeling labels as Bernoulli random variables and applying maximum likelihood estimation.

Step 1 — Probabilistic model: both cases in one expression

p(y \mid x) = \hat{y}^{\,y}(1-\hat{y})^{1-y}

y=1:\quad \hat{y}^1(1-\hat{y})^0 = \hat{y} \quad\checkmark

y=0:\quad \hat{y}^0(1-\hat{y})^1 = 1-\hat{y} \quad\checkmark

Step 2 — Joint likelihood over N i.i.d. examples

\mathcal{L}(w) = \prod_i p(y_i \mid x_i) = \prod_i \hat{y}_i^{\,y_i}(1-\hat{y}_i)^{1-y_i}

Step 3 — Log-likelihood (log turns ∏ into Σ)

\log\mathcal{L}(w) = \sum_i \log\bigl[\hat{y}_i^{\,y_i}(1-\hat{y}_i)^{1-y_i}\bigr] \qquad \log(\textstyle\prod)=\textstyle\sum\log

= \sum_i \bigl[y_i\log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\bigr] \qquad \log(a^b)=b\log a

Step 4 — Negate and average to get a minimization objective

L(w) = -\frac{1}{N}\sum_i\bigl[y_i\log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\bigr] \quad\checkmark\;\text{binary cross-entropy}

Intuition — penalty grows for confident mistakes

\hat{y}=0.99\;\text{(correct, confident)}\;\rightarrow\;{-\log(0.99)}\approx 0.01\;\text{(tiny)}

\hat{y}=0.50\;\text{(correct, uncertain)}\;\rightarrow\;{-\log(0.50)}\approx 0.69\;\text{(moderate)}

\hat{y}=0.01\;\text{(wrong,}\;\text{confident)}\;\rightarrow\;{-\log(0.01)}\approx 4.60\;\text{(huge)}

\hat{y}\to 0\text{ when }y=1:\;-\log(\hat{y})\to+\infty

\hat{y}\to 1\text{ when }y=0:\;-\log(1-\hat{y})\to+\infty

The full chain

\text{Bernoulli: }p(y\mid x)=\hat{y}^y(1-\hat{y})^{1-y}

\downarrow

\mathcal{L}(w)=\prod_i\hat{y}_i^{y_i}(1-\hat{y}_i)^{1-y_i}

\downarrow

\log\mathcal{L}(w)=\sum_i[y_i\log\hat{y}_i+(1-y_i)\log(1-\hat{y}_i)]

\downarrow

L=-\tfrac{1}{N}\sum_i[y_i\log\hat{y}_i+(1-y_i)\log(1-\hat{y}_i)]

▸ Interview One-Liner

MLE asks: what weights make the observed training data most probable? If we model each label as a Bernoulli random variable with P(y=1|x) = σ(wᵀx), the joint likelihood of all N observations is a product of Bernoullis. Taking the log turns that product into a sum. Negating to convert maximization to minimization gives exactly binary cross-entropy. Log loss isn't an arbitrary choice — it's the unique loss that follows from assuming Bernoulli labels and applying MLE.

Why MSE Is Non-Convex for Logistic Regression

Plugging the sigmoid prediction directly into MSE seems natural but breaks convexity — this is why logistic regression uses cross-entropy instead.

MSE loss with a sigmoid prediction

\hat{y} = \sigma(\theta^\top x) = \frac{1}{1+e^{-\theta^\top x}}

L(\theta) = \frac{1}{m}\sum_{i=1}^{m}\bigl(y_i - \sigma(\theta^\top x_i)\bigr)^2

σ is a nonlinear function of θ. Composing a quadratic loss with a nonlinear function destroys convexity — the resulting surface can have multiple local minima that gradient descent gets stuck in.

Proof — the Hessian is not PSD everywhere

Convexity requires the Hessian H = ∇²L(θ) to be positive semi-definite (all eigenvalues ≥ 0) at every point. Expanding the second derivative of σ:

Second derivative of the sigmoid

\frac{\partial \sigma}{\partial \theta} = \sigma(1-\sigma)\cdot x

\frac{\partial^2 \sigma}{\partial \theta^2} = \sigma(1-\sigma)(1-2\sigma)\cdot xx^\top

The factor (1 − 2σ) flips sign depending on whether σ > 0.5 or σ < 0.5. That sign flip means the Hessian is not guaranteed PSD everywhere — convexity breaks. Geometrically: linear-regression MSE is a single bowl with one global minimum reachable from any starting point; MSE-on-sigmoid has ridges and local minima that trap gradient descent depending on where it starts.

Why Cross-Entropy Fixes It

Cross-entropy Hessian — always PSD

L(\theta) = -\frac{1}{m}\sum_{i=1}^m \bigl[y_i\log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\bigr]

H = \frac{1}{m}X^\top S X, \quad S = \text{diag}\bigl(\hat{y}_i(1-\hat{y}_i)\bigr)

The log and sigmoid cancel each other's nonlinearity elegantly — every diagonal entry of S is positive since ŷᵢ ∈ (0,1), so XᵗSX is positive semi-definite by construction: guaranteed convex, one global minimum, gradient descent always converges.

The log-sigmoid cancellation, step by step

Work with the raw logit z = θᵗx (before the sigmoid) and one sample with label y ∈ {0,1}:

Step 1 — rewrite log ŷ and log(1−ŷ) purely in terms of z

\log\hat{y} = \log\frac{1}{1+e^{-z}} = -\log(1+e^{-z})

1-\hat{y} = \frac{e^{-z}}{1+e^{-z}} \;\Rightarrow\; \log(1-\hat{y}) = -z - \log(1+e^{-z})

The sigmoid has disappeared from both terms — only logs and exponentials of the raw linear output z remain.

Step 2 — substitute into the loss and simplify

L = -\bigl[y\log\hat{y} + (1-y)\log(1-\hat{y})\bigr]

= y\log(1+e^{-z}) + (1-y)\bigl(z+\log(1+e^{-z})\bigr)

L = \log(1+e^{-z}) + (1-y)z

The sigmoid is completely gone — the loss is now a function of the raw logit z alone (this is the softplus form of log loss).

Step 3 — gradient: where the clean (ŷ − y) form comes from

\frac{\partial L}{\partial z} = \frac{-e^{-z}}{1+e^{-z}} + (1-y) = -(1-\sigma(z)) + (1-y) = \sigma(z) - y = \hat{y}-y

Step 4 — Hessian: proving convexity

\frac{\partial^2 L}{\partial z^2} = \frac{\partial}{\partial z}(\hat{y}-y) = \sigma(z)(1-\sigma(z)) = \hat{y}(1-\hat{y})

\hat{y}(1-\hat{y}) > 0 \quad \text{everywhere, since } \hat{y}\in(0,1)

The log undoes the compressive nonlinearity of the sigmoid. What remains is softplus log(1+e⁻ᶻ) — smooth and convex — plus a linear term (1−y)z, which is trivially convex. The sum of convex functions is convex.

	MSE on σ(θᵗx)	Cross-Entropy
Convex?	No	Yes
Why	Sigmoid nonlinearity makes the Hessian indefinite	Log cancels sigmoid — Hessian σ(1−σ)·xxᵗ is always PSD
Gradient	Nested derivatives of σ — messy	ŷ − y — clean and linear
Optimization	Can get stuck in local minima	Always converges to the global minimum

▸ Interview One-Liner

Plugging σ(θᵗx) into MSE composes a quadratic with a nonlinear sigmoid, and the term (1−2σ) in the second derivative flips sign — so the Hessian isn't PSD everywhere and the loss surface gets local minima. Cross-entropy avoids this because the log and the sigmoid cancel: rewritten in terms of the raw logit z, the loss reduces to softplus log(1+e⁻ᶻ) plus a linear term, both convex. The choice of cross-entropy for classification isn't arbitrary — it's the loss that restores convexity once the output passes through a sigmoid.

Q&A

Section 3

Decision Trees

A decision tree recursively partitions the feature space by asking binary questions. Each internal node is a feature threshold; each leaf is a prediction.

Splitting Criterion

Gini impurity (sklearn default)

\text{Gini}(S) = 1 - \sum_k p_k^2 \quad (p_k = \text{fraction of class } k)

\text{Split gain} = \text{Gini}(\text{parent}) - \left[\frac{|\text{left}|}{|S|}\text{Gini}(\text{left}) + \frac{|\text{right}|}{|S|}\text{Gini}(\text{right})\right]

Entropy / Information Gain (alternative)

H(S) = -\sum_k p_k \log_2(p_k)

\text{Information Gain} = H(\text{parent}) - \text{weighted } H(\text{children})

Overfitting — The Main Problem

An unconstrained tree grows until every leaf is pure — perfectly memorizing training data.

→max_depth: Limit tree depth (e.g., max_depth=5). Most important hyperparameter.
→min_samples_split: Minimum samples required to split a node.
→min_samples_leaf: Minimum samples required at a leaf.
→min_impurity_decrease: Only split if the gain exceeds this threshold.
→Post-pruning: Grow full tree, then remove branches that don't improve validation performance (cost-complexity pruning).

Strengths and Weaknesses

Strengths	Weaknesses
Interpretable — visualize every decision	High variance — small data changes → very different tree
Handles mixed feature types (numeric + categorical)	Biased toward high-cardinality features
No feature scaling needed	Cannot extrapolate beyond training data range
Fast inference — O(depth) per sample	Only axis-aligned splits — no diagonal boundaries

▸ Interview Answer

Decision trees split the feature space recursively using Gini impurity or information gain. Interpretable and handle mixed types, but suffer from high variance — small changes in data produce very different trees. This is why they are almost always used inside ensemble methods (Random Forests, Gradient Boosting) rather than alone.

Section 4

Random Forests

Random Forests fix the high variance of individual trees by training many trees on different random subsets of data and features, then averaging predictions.

Animation — How Random Forests Work

Two Sources of Randomness

Bagging + feature randomness

1. BAGGING (Bootstrap Aggregating):
   Each tree trains on a bootstrap sample — N samples drawn WITH replacement.
   ~63% of samples appear, ~37% are never seen (out-of-bag).

2. FEATURE RANDOMNESS:
   At each split, consider only a random subset of features:
   Classification: sqrt(n_features) features per split
   Regression:     n_features/3 features per split

Why this helps:
   Trees on different data → different errors
   Trees on different features → decorrelated predictions
   Averaging decorrelated errors → variance drops by ~1/N
   Bias stays the same.

Out-of-Bag (OOB) Error

The ~37% of samples not seen by each tree form a free validation set. Average OOB predictions across all trees for an unbiased generalization estimate — no separate validation set needed.

Why exactly 37%?

Each bootstrap sample draws N times with replacement from N samples. The probability a specific sample is never drawn is (1 - 1/N)ᴺ. As N grows this converges to 1/e ≈ 0.368 — the same limit that defines Euler's number. So roughly 37% of samples are out-of-bag in every bootstrap, guaranteed by probability theory rather than by design.

OOB score in sklearn

rf = RandomForestClassifier(oob_score=True)
rf.fit(X_train, y_train)
print(rf.oob_score_)  # free validation accuracy estimate

Feature Importance

MDI vs permutation importance

# Mean Decrease in Impurity (MDI) — built-in but biased toward high-cardinality
importances = rf.feature_importances_

# Permutation importance — more reliable
from sklearn.inspection import permutation_importance
result = permutation_importance(rf, X_val, y_val)
# Measures accuracy drop when a feature is randomly shuffled

Key Hyperparameters

Parameter	Default	Effect
n_estimators	100	Number of trees. More = better, diminishing returns after ~200.
max_depth	None	None = grow until pure. Intentional here — fully grown trees have low bias and high variance, and averaging many of them cancels the variance while preserving the low bias. Setting max_depth introduces bias that averaging cannot fix.
max_features	sqrt(n)	Features per split. Lower = more decorrelation, higher bias.
min_samples_leaf	1	Min samples at leaf. Higher = smoother, less overfit.
n_jobs	-1	Use all CPU cores for parallel training.

▸ Interview Answer

Random Forests train N trees, each on a bootstrap sample with random feature subsets. Averaging their predictions cancels individual tree errors — variance drops by ~1/N while bias stays constant. Key advantages: OOB error estimate, built-in feature importance, robust to outliers, no scaling needed. Disadvantage: less interpretable than a single tree, can still overfit noisy data.

Q&A

Section 5

Gradient Boosting — XGBoost & LightGBM

Gradient boosting builds trees sequentially, each correcting the errors of all previous trees. Unlike Random Forests (parallel, variance reduction), boosting reduces bias by iteratively fixing mistakes.

The Core Idea — Fitting Residuals

Iterative residual fitting

Iteration 0: Start with a simple prediction (mean of y)
  F₀(x) = mean(y) = 50
  Residuals = y - F₀(x)

Iteration 1: Fit a tree to the RESIDUALS
  F₁(x) = F₀(x) + η·Tree₁(x)
  η = learning rate (e.g., 0.1) — shrinks each tree's contribution

Iteration 2: Fit a tree to NEW residuals = y - F₁(x)
  F₂(x) = F₁(x) + η·Tree₂(x)

...repeat for M iterations...
Final: F_M(x) = F₀(x) + η·Σᵢ Treeᵢ(x)

Connection to gradient descent:
  Residuals = negative gradient of MSE loss
  Each tree steps in the direction of steepest descent
  η controls step size (like learning rate in neural nets)

XGBoost — Key Innovations

→Built-in regularization: Adds L1 (α) and L2 (λ) penalty on leaf weights directly into the objective. Prevents overfitting at the tree level.
→Second-order gradients: Uses both gradient (first derivative) and Hessian (second derivative) of the loss. More accurate step direction than vanilla gradient boosting.
→Sparsity awareness: Handles missing values natively by learning the best default direction for missing data at each split.
→Parallel split finding: Sorts features once and caches them. Split finding is parallelized across features (not trees — boosting is inherently sequential).

XGBoost — key parameters

model = xgb.XGBClassifier(
    n_estimators=500,         # number of trees
    learning_rate=0.05,       # η — smaller = more trees, better generalization
    max_depth=6,              # tree depth
    subsample=0.8,            # row subsampling (like bagging)
    colsample_bytree=0.8,     # feature subsampling per tree
    reg_alpha=0.1,            # L1 regularization
    reg_lambda=1.0,           # L2 regularization
    early_stopping_rounds=50, # stop if no improvement
)

LightGBM vs XGBoost

→Leaf-wise vs level-wise growth: LightGBM always splits the highest-gain leaf next; XGBoost completes each full depth level before going deeper. Leaf-wise reaches lower loss faster but can overfit more aggressively.
→Histogram-based split finding: LightGBM bins N samples into K=255 buckets, reducing split computation from O(N log N) to O(K). This is the main source of its speed advantage.
→GOSS (Gradient-based One-Side Sampling): LightGBM focuses training on high-gradient samples (the ones the model is getting wrong) and downsamples low-gradient ones — further improving speed with minimal accuracy loss.
→EFB (Exclusive Feature Bundling): LightGBM bundles sparse mutually exclusive features into single features, reducing effective feature count and speeding up split finding.
→Second-order gradients: XGBoost uses both gradient and Hessian by default for a more accurate step direction — its main accuracy advantage over LightGBM.

Critical Hyperparameters

Parameter	Range	Effect
n_estimators	100–1000	More trees = lower bias. Use early stopping to find optimal.
learning_rate	0.01–0.3	Lower η → need more trees but better generalization. Set η=0.05 with 1000 trees.
max_depth	3–8	Controls tree complexity. Boosting typically uses shallow trees (3–6).
subsample	0.6–1.0	Row sampling per tree. Adds randomness, reduces overfitting.
min_child_weight	1–10	Min sum of instance weights in a leaf. Higher = more conservative.

▸ Interview Answer

Gradient boosting builds trees sequentially, each fitting the residuals of the previous ensemble. It reduces bias through iterative error correction — the opposite of Random Forests which reduce variance through averaging. XGBoost adds second-order gradients and regularization for better convergence. LightGBM uses histogram-based splitting and leaf-wise growth for 10–100× speedup on large datasets. In practice gradient boosting almost always outperforms Random Forests on tabular data.

Q&A

Section 6

Support Vector Machines

SVMs find the hyperplane that maximally separates two classes — the decision boundary with the largest margin. Training points closest to the boundary are called support vectors.

Maximum Margin — Hard Margin SVM

Optimization objective

\text{Margin} = \frac{2}{\|w\|}

\text{minimize:}\quad \tfrac{1}{2}\|w\|^2

\text{subject to:}\quad y_i(w^\top x_i + b) \geq 1 \quad \forall i

Soft Margin — Handling Non-Separable Data

Soft margin with slack variables

\text{minimize:}\quad \tfrac{1}{2}\|w\|^2 + C \sum_i \xi_i

\text{subject to:}\quad y_i(w^\top x_i + b) \geq 1 - \xi_i,\quad \xi_i \geq 0

C = 1/\lambda \;\text{(think of }C\text{ as inverse regularization strength)}

The Kernel Trick — Non-Linear Boundaries

SVMs only need dot products between points. Replace with a kernel K(xᵢ, xⱼ) that computes dot products in a higher-dimensional space without explicitly mapping there.

Common kernels

\text{Linear:}\quad K(x,z)=x^\top z \;\rightarrow\; \text{linear boundary}

\text{Polynomial:}\quad K(x,z)=(x^\top z+c)^d \;\rightarrow\; \text{degree-}d\text{ boundary}

\text{RBF/Gaussian:}\quad K(x,z)=\exp(-\gamma\|x-z\|^2) \;\rightarrow\; \text{infinite-dimensional space (default)}

\text{High }\gamma \;\rightarrow\; \text{complex boundary, narrow influence per support vector}

\text{Low }\gamma \;\rightarrow\; \text{smooth boundary, wide influence}

SVM Hyperparameters

Parameter	Values	Effect
C	0.001, 0.01, 0.1, 1, 10, 100	Regularization. Start at C=1, tune via cross-validation.
kernel	rbf, linear, poly	rbf is default and usually best. Use linear for high-D sparse data (text).
gamma	scale, auto, 0.001–1	RBF width. scale = 1/(n_features × X.var()) is a good default.
degree	2, 3, 4	Polynomial kernel degree only.

When SVMs Excel

→High-dimensional sparse data: Text classification with TF-IDF features — linear SVM is extremely fast and effective.
→Small datasets: Good generalization on small training sets where neural networks would overfit.
→Clear margin of separation: When classes are well-separated, SVMs find the optimal boundary by definition.

▸ Interview Answer

SVMs find the maximum-margin hyperplane — the decision boundary furthest from all training points. Only the support vectors (closest points to the boundary) determine the solution. The kernel trick enables non-linear boundaries by implicitly mapping to higher dimensions. SVMs work best on small-to-medium datasets with high-dimensional features (e.g., text). They scale poorly to large datasets (O(N²) to O(N³) training time) — use gradient boosting or neural networks instead.

Q&A

Section 7

Regularization — L1, L2, Elastic Net

Regularization adds a penalty on model weights to the loss function, discouraging the model from fitting noise. It is the primary tool for reducing overfitting.

L2 Regularization — Ridge

L2 math

\text{Loss} = \text{Original Loss} + \lambda \sum_i w_i^2 = \text{Original Loss} + \lambda \|w\|^2

w_i \leftarrow w_i(1 - 2\eta\lambda) - \eta \cdot \nabla \quad \text{(weight decay every step)}

L1 Regularization — Lasso

L1 math

\text{Loss} = \text{Original Loss} + \lambda \sum_i |w_i| = \text{Original Loss} + \lambda \|w\|_1

\frac{\partial L}{\partial w_i} \mathrel{+}= \lambda \cdot \text{sign}(w_i) \quad (\text{sign}(w_i) = +1 \text{ if } w_i > 0,\; -1 \text{ if } w_i < 0)

L1 vs L2 vs Elastic Net

	L1 (Lasso)	L2 (Ridge)	Elastic Net
Penalty	λ·\|\|w\|\|₁	λ·\|\|w\|\|₂²	λ₁·\|\|w\|\|₁ + λ₂·\|\|w\|\|₂²
Effect on weights	Can push to exactly 0	Shrinks, rarely zero	Sparse + stable
Feature selection	Yes — automatic	No — keeps all features	Grouped selection
Use when	Few features are relevant	Many features contribute a little	Correlated features
sklearn	Lasso, penalty='l1'	Ridge, penalty='l2'	ElasticNet(l1_ratio=0.5)

Choosing λ via cross-validation

# Note: sklearn uses C = 1/λ, so SMALL C = STRONG regularization
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}
gs = GridSearchCV(LogisticRegression(), param_grid, cv=5)
gs.fit(X_train, y_train)
print(gs.best_params_)

▸ Interview Answer

L2 (Ridge) shrinks all weights toward zero — smooth solution, keeps all features. L1 (Lasso) can push weights to exactly zero — sparse solution, automatic feature selection. Use L1 when most features are irrelevant. Use L2 when many features contribute a little. Use Elastic Net for correlated features (L1 arbitrarily picks one from a correlated group; Elastic Net selects the group together). λ is always tuned via cross-validation.

Q&A

Section 8

Bias-Variance Trade-off

The fundamental tension in supervised learning between underfitting (high bias) and overfitting (high variance). Every modeling decision involves this trade-off.

The decomposition

\mathbb{E}[\text{test error}] = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}

The Proof

Setup: the true target is y = f(x) + ε where E[ε] = 0 and E[ε²] = σ². Let f̄ = E[f̂] be the expected prediction of the model.

Step 1 — Add and subtract f̄ = E[f̂]

\mathbb{E}[(y-\hat{f})^2]

= \mathbb{E}[(y-\bar{f}+\bar{f}-\hat{f})^2]

= \mathbb{E}[(y-\bar{f})^2] + 2\,\mathbb{E}[(y-\bar{f})(\bar{f}-\hat{f})] + \mathbb{E}[(\bar{f}-\hat{f})^2]

Step 2 — Cross term vanishes (E[f̄ − f̂] = 0 since f̄ = E[f̂])

= \mathbb{E}[(y-\bar{f})^2] + \mathbb{E}[(\bar{f}-\hat{f})^2]

= \mathbb{E}[(f(x)+\varepsilon-\bar{f})^2] + \mathrm{Var}(\hat{f})

Step 3 — Expand the first term

= \mathbb{E}[(f(x)-\bar{f})^2] + 2\,\mathbb{E}[\varepsilon(f(x)-\bar{f})] + \mathbb{E}[\varepsilon^2] + \mathrm{Var}(\hat{f})

Step 4 — Cross term vanishes (E[ε] = 0, ε ⊥ f(x)−f̄); collect

= (f(x)-\bar{f})^2 + \mathrm{Var}(\hat{f}) + \sigma^2

= \underbrace{(f(x)-\bar{f})^2}_{\text{Bias}^2} + \underbrace{\mathrm{Var}(\hat{f})}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible Noise}}

Algorithm Positions on the Spectrum

Algorithm	Bias	Variance
Logistic Regression	High (assumes linearity)	Low (stable predictions)
Decision Tree (deep)	Low (flexible)	High (sensitive to data)
Decision Tree (shallow)	Medium	Medium
Random Forest	Low–Medium	Low (averaging reduces variance)
Gradient Boosting	Low (iterative correction)	Medium (needs tuning)
SVM (linear)	High	Low
SVM (RBF, high γ)	Low	High
Large Neural Network	Very Low	High (needs regularization)
Neural Network + Dropout	Low	Lower

Diagnosing Your Model

Four diagnostic patterns

High train error + High test error   → HIGH BIAS (underfitting)
  Fix: increase model complexity, add features, use more powerful algorithm

Low train error + High test error    → HIGH VARIANCE (overfitting)
  Fix: more training data, reduce complexity, regularization,
       ensemble methods, feature selection

Low train error + Low test error     → GOOD MODEL ✓

High train error + Low test error    → Impossible
  (would mean test is easier than training)

▸ Interview Answer

Bias is error from wrong model assumptions — the model is too simple to capture the pattern. Variance is error from sensitivity to training data — the model learned noise. High bias → underfitting. High variance → overfitting. Random Forests reduce variance by averaging many decorrelated trees. Boosting reduces bias by iteratively correcting errors. Regularization reduces variance by penalizing complexity. The goal is the model complexity that minimizes bias² + variance.

Section 9

Cross-Validation

The standard way to estimate model generalization and tune hyperparameters using only training data — without touching the test set.

K-Fold Cross-Validation

K=5 example

Split training data into K equal folds:

Fold 1: [VAL][TRN][TRN][TRN][TRN] → score₁
Fold 2: [TRN][VAL][TRN][TRN][TRN] → score₂
Fold 3: [TRN][TRN][VAL][TRN][TRN] → score₃
Fold 4: [TRN][TRN][TRN][VAL][TRN] → score₄
Fold 5: [TRN][TRN][TRN][TRN][VAL] → score₅

CV Score = mean(score₁, ..., score₅)
Std Dev  = std(score₁, ..., score₅)   ← tells you stability

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_train, y_train, cv=5)
print(f'{scores.mean():.3f} ± {scores.std():.3f}')

K value	Bias	Variance / Cost
K=5	Slightly higher (trains on 80%)	Lower variance, 5 model fits
K=10	Lower (trains on 90%)	Higher variance, 10 model fits
K=N (LOO)	Lowest (trains on N-1)	Very high variance, N model fits — expensive

K=5 or K=10 are the standard choices. K=10 for smaller datasets.

Stratified K-Fold

For classification, always use Stratified K-Fold — ensures each fold has the same class distribution as the full dataset. Critical for imbalanced data.

Stratified + nested CV

from sklearn.model_selection import StratifiedKFold, GridSearchCV, cross_val_score

# Standard stratified CV
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)

# Nested CV — unbiased estimate when tuning hyperparameters
# If you tune params using CV then report that score → optimistic bias
inner_cv = StratifiedKFold(n_splits=5)
outer_cv = StratifiedKFold(n_splits=5)
clf = GridSearchCV(estimator=model, param_grid=params, cv=inner_cv)
nested_score = cross_val_score(clf, X, y, cv=outer_cv)  # unbiased

Time Series Cross-Validation

TimeSeriesSplit — prevent future leakage

Standard K-Fold LEAKS future data into training.
For time series, always use TimeSeriesSplit:

Split 1: [TRN] [VAL]
Split 2: [TRN TRN] [VAL]
Split 3: [TRN TRN TRN] [VAL]

Training always precedes validation in time.

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)

▸ Interview Answer

Cross-validation estimates generalization by training and evaluating on K different splits of training data. Use Stratified K-Fold for classification to preserve class ratios. Use Nested CV when tuning hyperparameters to avoid optimistic bias. Use TimeSeriesSplit for temporal data to prevent leakage. K=5 or K=10 are the standard choices.

Section 10

Feature Engineering

Transforming raw data into features that better represent the underlying pattern. Often the highest-leverage activity in a classical ML project — better features beat better algorithms.

Numerical Features — Scaling

Most algorithms (SVMs, logistic regression, KNN) require scaled features. Tree-based models (Random Forests, XGBoost) do not need scaling.

Scaling options

StandardScaler:  z = (x - mean) / std
  → mean=0, std=1. Use for normally distributed features. Affected by outliers.

MinMaxScaler:    x' = (x - min) / (max - min)
  → Scales to [0,1]. Very sensitive to outliers.

RobustScaler:    x' = (x - median) / IQR
  → Uses median and IQR. Best for skewed distributions with outliers.

Log transform:   x' = log(x + 1)
  → For right-skewed distributions (income, emissions values).
  → Makes multiplicative relationships additive.

Categorical Features

Encoding strategies

ONE-HOT ENCODING:
  Color = [red, blue, green]  →  is_red, is_blue, is_green
  Use for nominal (unordered) categories with low cardinality (<20).
  Drop one category to avoid multicollinearity (drop_first=True).

ORDINAL ENCODING:
  Size = [small, medium, large]  →  [0, 1, 2]
  Use when categories have a natural order.

TARGET ENCODING (Mean Encoding):
  Replace each category with the mean target for that category.
  Very powerful for high-cardinality categories.
  DANGER: use within CV folds to prevent leakage.

HASH ENCODING:
  For very high cardinality (millions of categories).
  Hash category to a fixed number of buckets.

Feature Interactions

Manual interactions + polynomial features

# Domain-specific interaction terms
df['emissions_per_revenue'] = df['emissions'] / df['revenue']
df['high_emissions_low_audit'] = (
    (df['emissions_score'] > 70) & (df['audit_score'] < 30)
).astype(int)

# Polynomial features (for logistic regression / SVM)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)  # adds x₁², x₂², x₁x₂ etc.

# Note: XGBoost learns interactions automatically
# Polynomial features are mainly for linear models

Handling Missing Values

Imputation strategies

Simple imputation:
  Mean/Median: for numerical, when data is MAR (Missing at Random)
  Mode:        for categorical
  Constant:    fill with 0 or 'Unknown'

Advanced imputation:
  KNN imputation:        fill based on K nearest complete samples
  Iterative (MICE):      model each feature from all others

Missingness as a feature:
  df['emissions_is_missing'] = df['emissions'].isnull().astype(int)
  Often the fact that data is missing is itself informative.
  (e.g., supplier refusing to disclose emissions = high risk signal)

XGBoost/LightGBM handle missing values natively — no imputation needed.

▸ Interview Answer

Feature engineering is often more impactful than algorithm choice. Scale numerical features for linear models and SVMs (not needed for trees). One-hot encode low-cardinality categoricals, target-encode high-cardinality ones (within CV folds). Create domain-specific interaction features (e.g., emissions per dollar of revenue). Treat missingness as a feature when absence is itself informative.

Section 11

When Classical ML Beats Deep Learning

The 6 scenarios where you should reach for classical models over neural networks.

1

Small Datasets (< 10K samples)

Deep learning requires large amounts of data to learn useful representations. On small datasets it memorizes training examples rather than learning patterns.

< 1,000 samples:   Logistic Regression or SVM
1K - 10K samples:  Random Forest or XGBoost
10K - 100K:        XGBoost usually competitive with DL
> 100K samples:    Deep learning starts to win

Exception: Transfer learning (fine-tuned BERT) can work on small datasets.

2

Tabular / Structured Data

On tabular data, gradient boosting consistently outperforms deep learning. 2022 NeurIPS paper 'Why tree-based models still outperform deep learning on tabular data' (Grinsztajn et al.): XGBoost outperformed all DL methods on 45/45 tabular datasets. Sustainability data is almost entirely tabular — supplier scorecards, emissions inventories, regulatory filings.

3

Interpretability Required

In regulated domains (finance, healthcare, sustainability auditing) you need to explain WHY a model made a prediction.

Logistic Regression:  weights show feature contribution
Decision Tree:        visualize the exact decision path
XGBoost + SHAP:       best of both worlds — power + explanations

Example: 'Why is this supplier flagged as high-risk?'
  Decision tree: 'emissions_score > 70 AND audit_failures > 2'
  Neural network: 512 weights across 8 layers → hard to explain

4

Training Speed and Compute Constraints

XGBoost on CPU:     seconds to minutes
Random Forest:      trivially parallelizable on CPU
Neural network:     requires GPU, hours to days

Use classical ML when: rapid prototyping, limited budget,
need to retrain frequently, edge deployment without GPU.

5

Explicit Domain Features Available

When domain experts can hand-craft meaningful features, classical models use them directly. Neural networks have to rediscover these features from raw inputs.

emissions_per_dollar_revenue  ← you engineered this
yoy_emissions_change          ← you engineered this
audit_failure_rate            ← you engineered this

A logistic regression with 3 good features may outperform a neural
network on raw supplier data.

6

Causal Inference and Statistical Testing

When you need causal effects, hypothesis tests, or confidence intervals — classical statistical models are required.

Neural networks:    predict well, cannot estimate causal effects
Logistic regression: coefficients with p-values and CIs
Decision trees:      transparent enough for causal interpretation

Example: 'Does this intervention reduce supplier emissions?'
  → Causal inference (DiD, IV, RDD), not deep learning

Summary Table

Use Classical ML When…	Use Deep Learning When…
Small dataset (< 10K samples)	Large dataset (> 100K samples)
Tabular/structured data	Unstructured data (images, text, audio)
Interpretability required	Raw perceptual inputs (pixels, waveforms)
Limited compute budget	GPU/TPU available
Domain features available	Features must be learned from scratch
Causal inference needed	Pattern recognition / generation tasks
Fast iteration required	Long-running training is acceptable

▸ Interview One-Liner

My default for tabular data is always XGBoost first — it wins on tabular benchmarks, trains fast, handles missing values, and SHAP makes it explainable. I reach for deep learning when the data is unstructured (text, images), the dataset is very large (> 100K), or I need representation learning. For sustainability data — supplier tables, emissions inventories, regulatory filings — classical ML is almost always the right first choice.

Q&A

Section 12

Quick Reference Cheat Sheet

Algorithm One-Liners

▸Logistic Regression → linear classifier, sigmoid output, convex loss, use as baseline
▸Decision Tree → recursive binary splits on Gini/entropy, interpretable but high variance
▸Random Forest → N trees on bootstrap + random features, variance ↓ by ~1/N
▸XGBoost → sequential trees fitting residuals, L1/L2 built in, best for tabular
▸LightGBM → leaf-wise boosting, histogram splits, 10–100× faster on large data
▸SVM → maximum margin hyperplane, kernel trick, best for small high-D data

Regularization

▸L1 (Lasso) → sparse weights, feature selection, diamond constraint
▸L2 (Ridge) → small weights, keeps all features, sphere constraint
▸Elastic Net → L1 + L2, best for correlated features
▸λ tuning → always use cross-validation
▸sklearn: C = 1/λ → small C = strong regularization

Bias-Variance Diagnostics

▸High train + high test error → High Bias → increase complexity
▸Low train + high test error → High Variance → regularize, more data
▸Random Forest → fixes variance by averaging
▸Boosting → fixes bias by correcting errors
▸Regularization → always reduces variance, may increase bias

Feature Scaling Rules

▸SVM, Logistic Regression, KNN → ALWAYS scale (StandardScaler)
▸Decision Trees, Random Forest, XGBoost → scaling NOT needed
▸Skewed distributions → log transform first
▸Outliers → use RobustScaler not StandardScaler

When Classical Beats DL

▸Tabular data → XGBoost (always try first)
▸Small dataset < 10K → Logistic Regression or RF
▸Interpretability required → Logistic Regression + SHAP
▸Causal inference → linear models only
▸Limited compute → tree models on CPU

Cross-Validation Rules

▸Always stratify for classification
▸K=5 or K=10 are standard — K=10 for small data
▸Nested CV when tuning params to avoid optimistic bias
▸TimeSeriesSplit for temporal data — never standard K-Fold
▸OOB score in Random Forest = free K-Fold equivalent