> study notes · amazon applied scientist ii
Classical Machine Learning
The algorithms that still dominate tabular data and production ML: decision trees, random forests, gradient boosting, SVMs, logistic regression, regularization, bias-variance, cross-validation, and feature engineering.
- Linear Regression
- Logistic Regression
- Decision Trees
- Random Forests
- XGBoost
- LightGBM
- SVMs
- L1/L2
- Bias-Variance
- Cross-Validation
- Feature Engineering
> Table of Contents
Section 1
Linear Regression
Models a continuous output as a linear combination of input features. The simplest supervised learning algorithm and the foundation of logistic regression, regularization, and most of the statistical thinking behind ML.
Loss Function — MSE vs MAE
What each optimizes toward
| MSE | MAE | |
|---|---|---|
| Optimal estimator | Mean of target distribution | Median of target distribution |
| Sensitivity to outliers | High — squares large errors | Low — linear penalty |
| Gradient behavior | Smooth everywhere | Non-differentiable at zero |
| Optimization | Easier — smooth gradients | Harder — needs subgradients or IRLS |
Gradient behavior matters for linear regression specifically:
MAE gradient is discontinuous at zero — requires subgradient methods or iterative reweighted least squares (IRLS). This makes closed-form solutions unavailable for MAE.
MAE has no closed-form equivalent.
Huber loss — the practical middle ground
Quadratic for small errors like MSE, linear for large errors like MAE. Smooth gradients + outlier robustness. δ is a tunable threshold that defines what counts as a "small" error.
Solving — Two Approaches
Normal Equation — Full Derivation
Where X ∈ ℝⁿˣᵈ is the design matrix, y ∈ ℝⁿ the target vector, and θ ∈ ℝᵈ the parameters.
(Since y⊤Xθ is a scalar it equals its own transpose, so the two middle terms merge.)
Geometric intuition
X⊤X is invertible only when X has full column rank — no multicollinearity. Geometrically, ŷ = Xθ̂ is the orthogonal projection of y onto the column space of X. The residual y − Xθ̂ is perpendicular to every column of X, which is exactly what the normal equations encode:
Key Assumptions (LINE)
- →Linearity: Relationship between X and y is linear. Check: residual plot should show no pattern.
- →Independence: Observations are independent. Violated by time series data → use time series models.
- →Homoscedasticity: Constant variance of residuals across all X values. Fix if violated: log-transform the target y.
- →No Multicollinearity: Features should not be highly correlated. Check: Variance Inflation Factor (VIF > 10 = problem). Fix: remove one correlated feature, or use Ridge.
- →Normality of residuals: Required for valid hypothesis tests on coefficients. Less important for pure prediction tasks.
Evaluation Metrics
Linear vs Logistic Regression
| Linear Regression | Logistic Regression |
|---|---|
| Predicts a continuous value | Predicts a probability (0 to 1) |
| Output: ŷ = wᵀx + b (unbounded) | Output: σ(wᵀx + b) ∈ (0, 1) |
| Loss: Mean Squared Error | Loss: Binary Cross-Entropy |
| Use: predicting emissions values, revenue | Use: classifying high/low risk suppliers |
| Assumptions: normality, homoscedasticity | Fewer distributional assumptions |
▸ Interview Answer
Linear regression models continuous outputs as a weighted sum of features. The MSE loss is convex so gradient descent always finds the global minimum. For small feature sets the Normal Equation gives an exact closed-form solution. Key diagnostics: check residual plots for linearity, VIF for multicollinearity, and use Adjusted R² when comparing models. For sustainability use cases — predicting supplier emissions values, estimating carbon footprint from product attributes — linear regression with feature engineering is often a strong, interpretable baseline.
Q&A
Section 2
Logistic Regression
A linear classifier that models the probability of a binary outcome. Despite the name it is a classification algorithm. It is the foundation of many neural network concepts and the standard strong baseline.
Key Properties
- →Linear decision boundary: Can only separate classes with a straight line (hyperplane in N dims). Cannot learn XOR or circular boundaries without feature engineering.
- →Probabilistic output: Unlike SVMs, gives calibrated probabilities. Useful when you need P(class), not just the class label.
- →Interpretable: Each weight wᵢ tells you how much feature i contributes to the log-odds. Can audit which features drive predictions.
- →Fast to train: Convex loss function — gradient descent always finds the global minimum. No local optima.
Decision Boundary
The decision boundary is the set of points where ŷ = 0.5, which means σ(z) = 0.5 → z = 0 → Xw + b = 0. This is a hyperplane in feature space — a line in 2D, a plane in 3D.
Multiclass Extension — Softmax
▸ Interview Answer
Logistic regression models P(y=1|x) by applying a sigmoid to a linear combination of features. The log loss penalizes confident wrong predictions. It is convex so always finds the global optimum. I use it as a strong baseline — if it performs well, the problem is likely linearly separable and you don't need a complex model.
Assumptions
Fewer than linear regression — no normally distributed residuals, no homoscedasticity. But it has its own set.
Binary outcome
Dependent variable must be categorical (binary for standard logistic regression). Can't predict continuous values like emissions tonnage.
Independence of observations
Each example must be independent. Violated by time series, clustered data (suppliers within regions), or repeated measures. Fix: mixed effects models or account for the grouping structure.
Linearity of log-odds
The key assumption. Linear relationship between features and the log-odds of the outcome — not the probability directly. Check by plotting log-odds against each continuous feature. Fix by adding polynomial terms or binning if non-linear.
No multicollinearity
Correlated features cause unstable coefficient estimates — the model can't determine which deserves credit. Check VIF > 10. Fix: drop one, use Ridge (L2), or PCA.
No extreme outliers
A single extreme point can heavily shift the decision boundary. Check: z-scores (|z| > 3). Fix: remove true errors, robust scaling, or winsorization.
Large enough sample size
Rule of thumb: 10–20 events (minority class samples) per predictor variable. With D features need at least 10D minority class examples. Use L2 regularization to stabilize estimates when N is small.
| Assumption | Linear Reg | Logistic Reg |
|---|---|---|
| Binary/categorical outcome | ✗ | ✓ required |
| Independence | ✓ | ✓ |
| Linear relationship | in y | in log-odds |
| Normally distributed errors | ✓ | ✗ not needed |
| Homoscedasticity | ✓ | ✗ not needed |
| No multicollinearity | ✓ | ✓ |
| No extreme outliers | ✓ | ✓ |
| Large N | ✓ | ✓ |
▸ Interview One-Liner
Logistic regression assumes: binary outcome, independent observations, linearity in the log-odds (not the probability), no severe multicollinearity, no extreme outliers, and sufficient sample size — roughly 10–20 events per feature. It does not require normally distributed residuals or homoscedasticity — those are linear regression assumptions. The most commonly violated in practice is log-odds linearity, which you check by plotting log-odds against each continuous feature and fix by adding polynomial terms.
Why Log Loss? — The MLE Derivation
Log loss isn't an arbitrary choice. It is the unique loss that follows from modeling labels as Bernoulli random variables and applying maximum likelihood estimation.
▸ Interview One-Liner
MLE asks: what weights make the observed training data most probable? If we model each label as a Bernoulli random variable with P(y=1|x) = σ(wᵀx), the joint likelihood of all N observations is a product of Bernoullis. Taking the log turns that product into a sum. Negating to convert maximization to minimization gives exactly binary cross-entropy. Log loss isn't an arbitrary choice — it's the unique loss that follows from assuming Bernoulli labels and applying MLE.
Why MSE Is Non-Convex for Logistic Regression
Plugging the sigmoid prediction directly into MSE seems natural but breaks convexity — this is why logistic regression uses cross-entropy instead.
σ is a nonlinear function of θ. Composing a quadratic loss with a nonlinear function destroys convexity — the resulting surface can have multiple local minima that gradient descent gets stuck in.
Proof — the Hessian is not PSD everywhere
Convexity requires the Hessian H = ∇²L(θ) to be positive semi-definite (all eigenvalues ≥ 0) at every point. Expanding the second derivative of σ:
The factor (1 − 2σ) flips sign depending on whether σ > 0.5 or σ < 0.5. That sign flip means the Hessian is not guaranteed PSD everywhere — convexity breaks. Geometrically: linear-regression MSE is a single bowl with one global minimum reachable from any starting point; MSE-on-sigmoid has ridges and local minima that trap gradient descent depending on where it starts.
Why Cross-Entropy Fixes It
The log and sigmoid cancel each other's nonlinearity elegantly — every diagonal entry of S is positive since ŷᵢ ∈ (0,1), so XᵗSX is positive semi-definite by construction: guaranteed convex, one global minimum, gradient descent always converges.
The log-sigmoid cancellation, step by step
Work with the raw logit z = θᵗx (before the sigmoid) and one sample with label y ∈ {0,1}:
The sigmoid has disappeared from both terms — only logs and exponentials of the raw linear output z remain.
The sigmoid is completely gone — the loss is now a function of the raw logit z alone (this is the softplus form of log loss).
The log undoes the compressive nonlinearity of the sigmoid. What remains is softplus log(1+e⁻ᶻ) — smooth and convex — plus a linear term (1−y)z, which is trivially convex. The sum of convex functions is convex.
| MSE on σ(θᵗx) | Cross-Entropy | |
|---|---|---|
| Convex? | No | Yes |
| Why | Sigmoid nonlinearity makes the Hessian indefinite | Log cancels sigmoid — Hessian σ(1−σ)·xxᵗ is always PSD |
| Gradient | Nested derivatives of σ — messy | ŷ − y — clean and linear |
| Optimization | Can get stuck in local minima | Always converges to the global minimum |
▸ Interview One-Liner
Plugging σ(θᵗx) into MSE composes a quadratic with a nonlinear sigmoid, and the term (1−2σ) in the second derivative flips sign — so the Hessian isn't PSD everywhere and the loss surface gets local minima. Cross-entropy avoids this because the log and the sigmoid cancel: rewritten in terms of the raw logit z, the loss reduces to softplus log(1+e⁻ᶻ) plus a linear term, both convex. The choice of cross-entropy for classification isn't arbitrary — it's the loss that restores convexity once the output passes through a sigmoid.
Q&A
Section 3
Decision Trees
A decision tree recursively partitions the feature space by asking binary questions. Each internal node is a feature threshold; each leaf is a prediction.
Splitting Criterion
Overfitting — The Main Problem
An unconstrained tree grows until every leaf is pure — perfectly memorizing training data.
- →max_depth: Limit tree depth (e.g., max_depth=5). Most important hyperparameter.
- →min_samples_split: Minimum samples required to split a node.
- →min_samples_leaf: Minimum samples required at a leaf.
- →min_impurity_decrease: Only split if the gain exceeds this threshold.
- →Post-pruning: Grow full tree, then remove branches that don't improve validation performance (cost-complexity pruning).
Strengths and Weaknesses
| Strengths | Weaknesses |
|---|---|
| Interpretable — visualize every decision | High variance — small data changes → very different tree |
| Handles mixed feature types (numeric + categorical) | Biased toward high-cardinality features |
| No feature scaling needed | Cannot extrapolate beyond training data range |
| Fast inference — O(depth) per sample | Only axis-aligned splits — no diagonal boundaries |
▸ Interview Answer
Decision trees split the feature space recursively using Gini impurity or information gain. Interpretable and handle mixed types, but suffer from high variance — small changes in data produce very different trees. This is why they are almost always used inside ensemble methods (Random Forests, Gradient Boosting) rather than alone.
Section 4
Random Forests
Random Forests fix the high variance of individual trees by training many trees on different random subsets of data and features, then averaging predictions.
Animation — How Random Forests Work
Two Sources of Randomness
1. BAGGING (Bootstrap Aggregating):
Each tree trains on a bootstrap sample — N samples drawn WITH replacement.
~63% of samples appear, ~37% are never seen (out-of-bag).
2. FEATURE RANDOMNESS:
At each split, consider only a random subset of features:
Classification: sqrt(n_features) features per split
Regression: n_features/3 features per split
Why this helps:
Trees on different data → different errors
Trees on different features → decorrelated predictions
Averaging decorrelated errors → variance drops by ~1/N
Bias stays the same.Out-of-Bag (OOB) Error
The ~37% of samples not seen by each tree form a free validation set. Average OOB predictions across all trees for an unbiased generalization estimate — no separate validation set needed.
Why exactly 37%?
Each bootstrap sample draws N times with replacement from N samples. The probability a specific sample is never drawn is (1 - 1/N)ᴺ. As N grows this converges to 1/e ≈ 0.368 — the same limit that defines Euler's number. So roughly 37% of samples are out-of-bag in every bootstrap, guaranteed by probability theory rather than by design.
rf = RandomForestClassifier(oob_score=True)
rf.fit(X_train, y_train)
print(rf.oob_score_) # free validation accuracy estimateFeature Importance
# Mean Decrease in Impurity (MDI) — built-in but biased toward high-cardinality
importances = rf.feature_importances_
# Permutation importance — more reliable
from sklearn.inspection import permutation_importance
result = permutation_importance(rf, X_val, y_val)
# Measures accuracy drop when a feature is randomly shuffledKey Hyperparameters
| Parameter | Default | Effect |
|---|---|---|
| n_estimators | 100 | Number of trees. More = better, diminishing returns after ~200. |
| max_depth | None | None = grow until pure. Intentional here — fully grown trees have low bias and high variance, and averaging many of them cancels the variance while preserving the low bias. Setting max_depth introduces bias that averaging cannot fix. |
| max_features | sqrt(n) | Features per split. Lower = more decorrelation, higher bias. |
| min_samples_leaf | 1 | Min samples at leaf. Higher = smoother, less overfit. |
| n_jobs | -1 | Use all CPU cores for parallel training. |
▸ Interview Answer
Random Forests train N trees, each on a bootstrap sample with random feature subsets. Averaging their predictions cancels individual tree errors — variance drops by ~1/N while bias stays constant. Key advantages: OOB error estimate, built-in feature importance, robust to outliers, no scaling needed. Disadvantage: less interpretable than a single tree, can still overfit noisy data.
Q&A
Section 5
Gradient Boosting — XGBoost & LightGBM
Gradient boosting builds trees sequentially, each correcting the errors of all previous trees. Unlike Random Forests (parallel, variance reduction), boosting reduces bias by iteratively fixing mistakes.
The Core Idea — Fitting Residuals
Iteration 0: Start with a simple prediction (mean of y)
F₀(x) = mean(y) = 50
Residuals = y - F₀(x)
Iteration 1: Fit a tree to the RESIDUALS
F₁(x) = F₀(x) + η·Tree₁(x)
η = learning rate (e.g., 0.1) — shrinks each tree's contribution
Iteration 2: Fit a tree to NEW residuals = y - F₁(x)
F₂(x) = F₁(x) + η·Tree₂(x)
...repeat for M iterations...
Final: F_M(x) = F₀(x) + η·Σᵢ Treeᵢ(x)
Connection to gradient descent:
Residuals = negative gradient of MSE loss
Each tree steps in the direction of steepest descent
η controls step size (like learning rate in neural nets)XGBoost — Key Innovations
- →Built-in regularization: Adds L1 (α) and L2 (λ) penalty on leaf weights directly into the objective. Prevents overfitting at the tree level.
- →Second-order gradients: Uses both gradient (first derivative) and Hessian (second derivative) of the loss. More accurate step direction than vanilla gradient boosting.
- →Sparsity awareness: Handles missing values natively by learning the best default direction for missing data at each split.
- →Parallel split finding: Sorts features once and caches them. Split finding is parallelized across features (not trees — boosting is inherently sequential).
model = xgb.XGBClassifier(
n_estimators=500, # number of trees
learning_rate=0.05, # η — smaller = more trees, better generalization
max_depth=6, # tree depth
subsample=0.8, # row subsampling (like bagging)
colsample_bytree=0.8, # feature subsampling per tree
reg_alpha=0.1, # L1 regularization
reg_lambda=1.0, # L2 regularization
early_stopping_rounds=50, # stop if no improvement
)LightGBM vs XGBoost
- →Leaf-wise vs level-wise growth: LightGBM always splits the highest-gain leaf next; XGBoost completes each full depth level before going deeper. Leaf-wise reaches lower loss faster but can overfit more aggressively.
- →Histogram-based split finding: LightGBM bins N samples into K=255 buckets, reducing split computation from O(N log N) to O(K). This is the main source of its speed advantage.
- →GOSS (Gradient-based One-Side Sampling): LightGBM focuses training on high-gradient samples (the ones the model is getting wrong) and downsamples low-gradient ones — further improving speed with minimal accuracy loss.
- →EFB (Exclusive Feature Bundling): LightGBM bundles sparse mutually exclusive features into single features, reducing effective feature count and speeding up split finding.
- →Second-order gradients: XGBoost uses both gradient and Hessian by default for a more accurate step direction — its main accuracy advantage over LightGBM.
Critical Hyperparameters
| Parameter | Range | Effect |
|---|---|---|
| n_estimators | 100–1000 | More trees = lower bias. Use early stopping to find optimal. |
| learning_rate | 0.01–0.3 | Lower η → need more trees but better generalization. Set η=0.05 with 1000 trees. |
| max_depth | 3–8 | Controls tree complexity. Boosting typically uses shallow trees (3–6). |
| subsample | 0.6–1.0 | Row sampling per tree. Adds randomness, reduces overfitting. |
| min_child_weight | 1–10 | Min sum of instance weights in a leaf. Higher = more conservative. |
▸ Interview Answer
Gradient boosting builds trees sequentially, each fitting the residuals of the previous ensemble. It reduces bias through iterative error correction — the opposite of Random Forests which reduce variance through averaging. XGBoost adds second-order gradients and regularization for better convergence. LightGBM uses histogram-based splitting and leaf-wise growth for 10–100× speedup on large datasets. In practice gradient boosting almost always outperforms Random Forests on tabular data.
Q&A
Section 6
Support Vector Machines
SVMs find the hyperplane that maximally separates two classes — the decision boundary with the largest margin. Training points closest to the boundary are called support vectors.
Maximum Margin — Hard Margin SVM
Soft Margin — Handling Non-Separable Data
The Kernel Trick — Non-Linear Boundaries
SVMs only need dot products between points. Replace with a kernel K(xᵢ, xⱼ) that computes dot products in a higher-dimensional space without explicitly mapping there.
SVM Hyperparameters
| Parameter | Values | Effect |
|---|---|---|
| C | 0.001, 0.01, 0.1, 1, 10, 100 | Regularization. Start at C=1, tune via cross-validation. |
| kernel | rbf, linear, poly | rbf is default and usually best. Use linear for high-D sparse data (text). |
| gamma | scale, auto, 0.001–1 | RBF width. scale = 1/(n_features × X.var()) is a good default. |
| degree | 2, 3, 4 | Polynomial kernel degree only. |
When SVMs Excel
- →High-dimensional sparse data: Text classification with TF-IDF features — linear SVM is extremely fast and effective.
- →Small datasets: Good generalization on small training sets where neural networks would overfit.
- →Clear margin of separation: When classes are well-separated, SVMs find the optimal boundary by definition.
▸ Interview Answer
SVMs find the maximum-margin hyperplane — the decision boundary furthest from all training points. Only the support vectors (closest points to the boundary) determine the solution. The kernel trick enables non-linear boundaries by implicitly mapping to higher dimensions. SVMs work best on small-to-medium datasets with high-dimensional features (e.g., text). They scale poorly to large datasets (O(N²) to O(N³) training time) — use gradient boosting or neural networks instead.
Q&A
Section 7
Regularization — L1, L2, Elastic Net
Regularization adds a penalty on model weights to the loss function, discouraging the model from fitting noise. It is the primary tool for reducing overfitting.
L2 Regularization — Ridge
L1 Regularization — Lasso
L1 vs L2 vs Elastic Net
| L1 (Lasso) | L2 (Ridge) | Elastic Net | |
|---|---|---|---|
| Penalty | λ·||w||₁ | λ·||w||₂² | λ₁·||w||₁ + λ₂·||w||₂² |
| Effect on weights | Can push to exactly 0 | Shrinks, rarely zero | Sparse + stable |
| Feature selection | Yes — automatic | No — keeps all features | Grouped selection |
| Use when | Few features are relevant | Many features contribute a little | Correlated features |
| sklearn | Lasso, penalty='l1' | Ridge, penalty='l2' | ElasticNet(l1_ratio=0.5) |
# Note: sklearn uses C = 1/λ, so SMALL C = STRONG regularization
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}
gs = GridSearchCV(LogisticRegression(), param_grid, cv=5)
gs.fit(X_train, y_train)
print(gs.best_params_)▸ Interview Answer
L2 (Ridge) shrinks all weights toward zero — smooth solution, keeps all features. L1 (Lasso) can push weights to exactly zero — sparse solution, automatic feature selection. Use L1 when most features are irrelevant. Use L2 when many features contribute a little. Use Elastic Net for correlated features (L1 arbitrarily picks one from a correlated group; Elastic Net selects the group together). λ is always tuned via cross-validation.
Q&A
Section 8
Bias-Variance Trade-off
The fundamental tension in supervised learning between underfitting (high bias) and overfitting (high variance). Every modeling decision involves this trade-off.
The Proof
Setup: the true target is y = f(x) + ε where E[ε] = 0 and E[ε²] = σ². Let f̄ = E[f̂] be the expected prediction of the model.
Algorithm Positions on the Spectrum
| Algorithm | Bias | Variance |
|---|---|---|
| Logistic Regression | High (assumes linearity) | Low (stable predictions) |
| Decision Tree (deep) | Low (flexible) | High (sensitive to data) |
| Decision Tree (shallow) | Medium | Medium |
| Random Forest | Low–Medium | Low (averaging reduces variance) |
| Gradient Boosting | Low (iterative correction) | Medium (needs tuning) |
| SVM (linear) | High | Low |
| SVM (RBF, high γ) | Low | High |
| Large Neural Network | Very Low | High (needs regularization) |
| Neural Network + Dropout | Low | Lower |
Diagnosing Your Model
High train error + High test error → HIGH BIAS (underfitting)
Fix: increase model complexity, add features, use more powerful algorithm
Low train error + High test error → HIGH VARIANCE (overfitting)
Fix: more training data, reduce complexity, regularization,
ensemble methods, feature selection
Low train error + Low test error → GOOD MODEL ✓
High train error + Low test error → Impossible
(would mean test is easier than training)▸ Interview Answer
Bias is error from wrong model assumptions — the model is too simple to capture the pattern. Variance is error from sensitivity to training data — the model learned noise. High bias → underfitting. High variance → overfitting. Random Forests reduce variance by averaging many decorrelated trees. Boosting reduces bias by iteratively correcting errors. Regularization reduces variance by penalizing complexity. The goal is the model complexity that minimizes bias² + variance.
Section 9
Cross-Validation
The standard way to estimate model generalization and tune hyperparameters using only training data — without touching the test set.
K-Fold Cross-Validation
Split training data into K equal folds:
Fold 1: [VAL][TRN][TRN][TRN][TRN] → score₁
Fold 2: [TRN][VAL][TRN][TRN][TRN] → score₂
Fold 3: [TRN][TRN][VAL][TRN][TRN] → score₃
Fold 4: [TRN][TRN][TRN][VAL][TRN] → score₄
Fold 5: [TRN][TRN][TRN][TRN][VAL] → score₅
CV Score = mean(score₁, ..., score₅)
Std Dev = std(score₁, ..., score₅) ← tells you stability
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_train, y_train, cv=5)
print(f'{scores.mean():.3f} ± {scores.std():.3f}')| K value | Bias | Variance / Cost |
|---|---|---|
| K=5 | Slightly higher (trains on 80%) | Lower variance, 5 model fits |
| K=10 | Lower (trains on 90%) | Higher variance, 10 model fits |
| K=N (LOO) | Lowest (trains on N-1) | Very high variance, N model fits — expensive |
K=5 or K=10 are the standard choices. K=10 for smaller datasets.
Stratified K-Fold
For classification, always use Stratified K-Fold — ensures each fold has the same class distribution as the full dataset. Critical for imbalanced data.
from sklearn.model_selection import StratifiedKFold, GridSearchCV, cross_val_score
# Standard stratified CV
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)
# Nested CV — unbiased estimate when tuning hyperparameters
# If you tune params using CV then report that score → optimistic bias
inner_cv = StratifiedKFold(n_splits=5)
outer_cv = StratifiedKFold(n_splits=5)
clf = GridSearchCV(estimator=model, param_grid=params, cv=inner_cv)
nested_score = cross_val_score(clf, X, y, cv=outer_cv) # unbiasedTime Series Cross-Validation
Standard K-Fold LEAKS future data into training.
For time series, always use TimeSeriesSplit:
Split 1: [TRN] [VAL]
Split 2: [TRN TRN] [VAL]
Split 3: [TRN TRN TRN] [VAL]
Training always precedes validation in time.
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)▸ Interview Answer
Cross-validation estimates generalization by training and evaluating on K different splits of training data. Use Stratified K-Fold for classification to preserve class ratios. Use Nested CV when tuning hyperparameters to avoid optimistic bias. Use TimeSeriesSplit for temporal data to prevent leakage. K=5 or K=10 are the standard choices.
Section 10
Feature Engineering
Transforming raw data into features that better represent the underlying pattern. Often the highest-leverage activity in a classical ML project — better features beat better algorithms.
Numerical Features — Scaling
Most algorithms (SVMs, logistic regression, KNN) require scaled features. Tree-based models (Random Forests, XGBoost) do not need scaling.
StandardScaler: z = (x - mean) / std
→ mean=0, std=1. Use for normally distributed features. Affected by outliers.
MinMaxScaler: x' = (x - min) / (max - min)
→ Scales to [0,1]. Very sensitive to outliers.
RobustScaler: x' = (x - median) / IQR
→ Uses median and IQR. Best for skewed distributions with outliers.
Log transform: x' = log(x + 1)
→ For right-skewed distributions (income, emissions values).
→ Makes multiplicative relationships additive.Categorical Features
ONE-HOT ENCODING:
Color = [red, blue, green] → is_red, is_blue, is_green
Use for nominal (unordered) categories with low cardinality (<20).
Drop one category to avoid multicollinearity (drop_first=True).
ORDINAL ENCODING:
Size = [small, medium, large] → [0, 1, 2]
Use when categories have a natural order.
TARGET ENCODING (Mean Encoding):
Replace each category with the mean target for that category.
Very powerful for high-cardinality categories.
DANGER: use within CV folds to prevent leakage.
HASH ENCODING:
For very high cardinality (millions of categories).
Hash category to a fixed number of buckets.Feature Interactions
# Domain-specific interaction terms
df['emissions_per_revenue'] = df['emissions'] / df['revenue']
df['high_emissions_low_audit'] = (
(df['emissions_score'] > 70) & (df['audit_score'] < 30)
).astype(int)
# Polynomial features (for logistic regression / SVM)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X) # adds x₁², x₂², x₁x₂ etc.
# Note: XGBoost learns interactions automatically
# Polynomial features are mainly for linear modelsHandling Missing Values
Simple imputation:
Mean/Median: for numerical, when data is MAR (Missing at Random)
Mode: for categorical
Constant: fill with 0 or 'Unknown'
Advanced imputation:
KNN imputation: fill based on K nearest complete samples
Iterative (MICE): model each feature from all others
Missingness as a feature:
df['emissions_is_missing'] = df['emissions'].isnull().astype(int)
Often the fact that data is missing is itself informative.
(e.g., supplier refusing to disclose emissions = high risk signal)
XGBoost/LightGBM handle missing values natively — no imputation needed.▸ Interview Answer
Feature engineering is often more impactful than algorithm choice. Scale numerical features for linear models and SVMs (not needed for trees). One-hot encode low-cardinality categoricals, target-encode high-cardinality ones (within CV folds). Create domain-specific interaction features (e.g., emissions per dollar of revenue). Treat missingness as a feature when absence is itself informative.
Section 11
When Classical ML Beats Deep Learning
The 6 scenarios where you should reach for classical models over neural networks.
Small Datasets (< 10K samples)
Deep learning requires large amounts of data to learn useful representations. On small datasets it memorizes training examples rather than learning patterns.
< 1,000 samples: Logistic Regression or SVM
1K - 10K samples: Random Forest or XGBoost
10K - 100K: XGBoost usually competitive with DL
> 100K samples: Deep learning starts to win
Exception: Transfer learning (fine-tuned BERT) can work on small datasets.Tabular / Structured Data
On tabular data, gradient boosting consistently outperforms deep learning. 2022 NeurIPS paper 'Why tree-based models still outperform deep learning on tabular data' (Grinsztajn et al.): XGBoost outperformed all DL methods on 45/45 tabular datasets. Sustainability data is almost entirely tabular — supplier scorecards, emissions inventories, regulatory filings.
Interpretability Required
In regulated domains (finance, healthcare, sustainability auditing) you need to explain WHY a model made a prediction.
Logistic Regression: weights show feature contribution
Decision Tree: visualize the exact decision path
XGBoost + SHAP: best of both worlds — power + explanations
Example: 'Why is this supplier flagged as high-risk?'
Decision tree: 'emissions_score > 70 AND audit_failures > 2'
Neural network: 512 weights across 8 layers → hard to explainTraining Speed and Compute Constraints
XGBoost on CPU: seconds to minutes
Random Forest: trivially parallelizable on CPU
Neural network: requires GPU, hours to days
Use classical ML when: rapid prototyping, limited budget,
need to retrain frequently, edge deployment without GPU.Explicit Domain Features Available
When domain experts can hand-craft meaningful features, classical models use them directly. Neural networks have to rediscover these features from raw inputs.
emissions_per_dollar_revenue ← you engineered this
yoy_emissions_change ← you engineered this
audit_failure_rate ← you engineered this
A logistic regression with 3 good features may outperform a neural
network on raw supplier data.Causal Inference and Statistical Testing
When you need causal effects, hypothesis tests, or confidence intervals — classical statistical models are required.
Neural networks: predict well, cannot estimate causal effects
Logistic regression: coefficients with p-values and CIs
Decision trees: transparent enough for causal interpretation
Example: 'Does this intervention reduce supplier emissions?'
→ Causal inference (DiD, IV, RDD), not deep learningSummary Table
| Use Classical ML When… | Use Deep Learning When… |
|---|---|
| Small dataset (< 10K samples) | Large dataset (> 100K samples) |
| Tabular/structured data | Unstructured data (images, text, audio) |
| Interpretability required | Raw perceptual inputs (pixels, waveforms) |
| Limited compute budget | GPU/TPU available |
| Domain features available | Features must be learned from scratch |
| Causal inference needed | Pattern recognition / generation tasks |
| Fast iteration required | Long-running training is acceptable |
▸ Interview One-Liner
My default for tabular data is always XGBoost first — it wins on tabular benchmarks, trains fast, handles missing values, and SHAP makes it explainable. I reach for deep learning when the data is unstructured (text, images), the dataset is very large (> 100K), or I need representation learning. For sustainability data — supplier tables, emissions inventories, regulatory filings — classical ML is almost always the right first choice.
Q&A
Section 12
Quick Reference Cheat Sheet
Algorithm One-Liners
- ▸Logistic Regression → linear classifier, sigmoid output, convex loss, use as baseline
- ▸Decision Tree → recursive binary splits on Gini/entropy, interpretable but high variance
- ▸Random Forest → N trees on bootstrap + random features, variance ↓ by ~1/N
- ▸XGBoost → sequential trees fitting residuals, L1/L2 built in, best for tabular
- ▸LightGBM → leaf-wise boosting, histogram splits, 10–100× faster on large data
- ▸SVM → maximum margin hyperplane, kernel trick, best for small high-D data
Regularization
- ▸L1 (Lasso) → sparse weights, feature selection, diamond constraint
- ▸L2 (Ridge) → small weights, keeps all features, sphere constraint
- ▸Elastic Net → L1 + L2, best for correlated features
- ▸λ tuning → always use cross-validation
- ▸sklearn: C = 1/λ → small C = strong regularization
Bias-Variance Diagnostics
- ▸High train + high test error → High Bias → increase complexity
- ▸Low train + high test error → High Variance → regularize, more data
- ▸Random Forest → fixes variance by averaging
- ▸Boosting → fixes bias by correcting errors
- ▸Regularization → always reduces variance, may increase bias
Feature Scaling Rules
- ▸SVM, Logistic Regression, KNN → ALWAYS scale (StandardScaler)
- ▸Decision Trees, Random Forest, XGBoost → scaling NOT needed
- ▸Skewed distributions → log transform first
- ▸Outliers → use RobustScaler not StandardScaler
When Classical Beats DL
- ▸Tabular data → XGBoost (always try first)
- ▸Small dataset < 10K → Logistic Regression or RF
- ▸Interpretability required → Logistic Regression + SHAP
- ▸Causal inference → linear models only
- ▸Limited compute → tree models on CPU
Cross-Validation Rules
- ▸Always stratify for classification
- ▸K=5 or K=10 are standard — K=10 for small data
- ▸Nested CV when tuning params to avoid optimistic bias
- ▸TimeSeriesSplit for temporal data — never standard K-Fold
- ▸OOB score in Random Forest = free K-Fold equivalent