The First Model We OwnL03
Interactive · Field Notes
Chapter Three · northstar_churn.csv

Ten thousand customers,
and a column called churned.

Marcus sets the brief on Monday morning. Off-the-shelf is over — can Sarah train NorthStar's own model on NorthStar's own data? She opens northstar_churn.csv: 10,000 customers, 11 features, one target. By Friday she will hold three new tools: a clean preprocessing pipeline, cross-validated training, and a threshold chosen by the business, not the math.

11
Feature columns
12%
Positive class rate
5
Cross-val folds

Toggle the buttons. Move the slider. Every figure here is live.
I.
The 60-70% rule

Most of an ML week is preprocessing.

Real CSVs are messy. Missing cells, mixed scales, free-text categories. Click the four buttons below to walk a raw NorthStar row through the preprocessing it needs before any model sees it.

northstar_churn.csv · head(5)raw — nothing applied
tieravg_spendtenure_mreviewschurned
II.
The leakage trap

Fit the scaler on the training data only.

The most common silent bug in a supervised week. Mean and standard deviation that include test rows are pretending to know the future. Two orderings, two outcomes.

✕ Wrong · Leak

Fit, then split.

1
fit StandardScaler on all 10,000 rows
2
train_test_split(X_scaled, y) · 80 / 20
3
fit model on training portion
4
evaluate on test portion — looks great!

The scaler used test-row statistics. At deployment, future test rows do not exist yet. The evaluation has quietly cheated — production accuracy will be lower than the notebook says.

✓ Right · Clean

Split, then fit.

1
train_test_split(X, y) · 80 / 20 · stratify=y
2
fit StandardScaler on X_train only
3
transform X_train and X_test with the same scaler
4
fit model on X_train · evaluate on X_test — honest

The test set is treated as the future. Better still: wrap everything in a Pipeline so this discipline is enforced automatically — including across cross-validation folds.

# The pattern every supervised model in M3 starts with
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute        import SimpleImputer

prep = ColumnTransformer([
    ("num", Pipeline([("imp", SimpleImputer(strategy="median")),
                       ("sc",  StandardScaler())]),         numeric_cols),
    ("cat", OneHotEncoder(handle_unknown="ignore"),       categorical_cols),
])

model = Pipeline([("prep", prep), ("clf", LogisticRegression())])
model.fit(X_train, y_train)   # prep is re-fit per CV fold automatically
III.
A discipline

One split is a guess. K folds are an estimate.

A single 80/20 split changes its mind every time you change the random seed. Cross-validation averages across folds so the number on your slide is signal, not coincidence.

Single split, five seeds.

Same model, same data, different random_state. Which one do you put on the slide?

seed 0
train
test
86.4%
seed 1
train
test
89.1%
seed 2
train
test
91.3%
seed 3
train
test
88.7%
seed 4
train
test
89.5%
Range: 86.4 – 91.3% Spread: 4.9 pp Mean ± sd: 89.0 ± 1.6%

A 4.9pp swing between seeds means the test set is too small to trust a single number. One run is a coin flip dressed up as evaluation.

K-fold cross-validation.

Each row gets to be in validation exactly once. Preprocessing is re-fit per fold — never trained on its own validator.

Fold 1
val
tr
tr
tr
tr
87.8%
Fold 2
tr
val
tr
tr
tr
89.5%
Fold 3
tr
tr
val
tr
tr
88.1%
Fold 4
tr
tr
tr
val
tr
89.9%
Fold 5
tr
tr
tr
tr
val
88.4%
5-fold mean: 88.7% Std: ± 0.8 pp Report this — with CI

Tighter spread, more honest. The mean is the headline; the spread tells you how seriously to take it. Pair it with the L02 confidence interval.

IV.
A diagnostic

Accuracy lies. Move the threshold.

On NorthStar's 12% churn rate, a model that predicts "no churn" for everyone scores 88% accuracy — and is useless. The confusion matrix tells the truth. Drag the threshold and watch precision and recall trade places.

Model output · 200 held-out customersthreshold = 0.50
← flag fewer · high precisionflag more · high recall →

Each tick is one customer. Vermillion = will actually churn. Teal = will not. The vertical line is the threshold.

Predicted churn
Predicted stay
Actually churned
TP · caught
FN · missed
Actually stayed
FP · wasted call
TN · correct skip
Accuracy
Precision
Recall
F1
V.
A business decision

0.5 is rarely right. Capacity and cost decide.

The retention team can call 80 customers a week. A missed churner costs £240 of lifetime value; a wasted call costs £15 of analyst time. The right threshold is the one that fits both.

Two business inputs

20300
£40£600
£2£120

Capacity caps flag volume. Cost ratio drives where the boundary should sit — cheap mistakes get more headroom, expensive ones drive the threshold up or down.

Recommended threshold
flags / week: · capacity:
Expected weekly cost ·
VI.
A checklist

Three checks before you ship a threshold.

Skip any of these and you'll either drown the team in low-value flags or quietly burn your held-out evaluation.

1

What is the capacity?

How many flagged cases can the team actually act on per week? The threshold's job is to fit inside that capacity — not to maximise a metric.

✕ "F1 is highest at threshold 0.32." ✓ "Retention can call 80/week. Threshold 0.62 flags 78."
2

What is the cost?

A missed churner and a wasted call are rarely equally expensive. The cheaper error gets more headroom; the costly one drives the threshold.

✕ "Default 0.5 — balanced, right?" ✓ "FN = £240, FP = £15. Lower the threshold."
3

Which data chose it?

If you tuned the threshold to maximise F1 on the held-out test set, you've used that data to make a decision — the evaluation is no longer clean.

✕ "Searched 0.0–1.0 on test — best F1 wins." ✓ "Pick from capacity + cost. Then report metrics."
VII.
Check yourself

Nine questions from the field notebook.

Try each mentally first. Click to flip the card. Numbering matches lesson.md.

▸ Click to reveal subscription_tier · free / basic / premium. Label-encode or one-hot?
It's ordinal, so label encoding is defensible. But one-hot doesn't hurt and is safer — it doesn't force the model to assume free→basic equals basic→premium. For a 3-tier ordinal, either is fine.
▸ Click to reveal avg_review_polarity missing for 30% of customers. Best imputation?
Missingness is informative — non-reviewers behave differently. Add a was_missing_polarity flag and impute (median, or 0). Keep both signals.
▸ Click to reveal You StandardScaler.fit on the full dataset, then split. What's wrong?
You've leaked test-set information into training. The scaler's mean / sd include test rows — rows that, at deployment, won't exist yet. Always: split first, fit the scaler on training only, transform both.
▸ Click to reveal Why stratify=y on imbalanced data?
Without stratification, a random split could give a test set with very few (or very many) positives by luck. Stratifying guarantees train and test mirror the full class proportions, so the evaluation actually means something.
▸ Click to reveal 80/20 split · seed 0 → 87%. Seed 7 → 91%. Which number do you report?
Neither. A 4-point swing between seeds means a single split is too noisy. Run 5-fold cross-validation and report mean ± std.
▸ Click to reveal cross_val_score(pipeline, X_train, y_train) — does sklearn re-fit preprocessing per fold?
Yes — precisely why preprocessing must live inside the pipeline. Each fold becomes its own train/val split; the scaler, imputer, encoder are re-fit on the fold's training portion only. Without the pipeline, you'd leak silently.
▸ Click to reveal TP=120, FP=80, FN=30, TN=770. Compute P / R / F1 / Acc.
Precision = 120 / 200 = 0.60
Recall = 120 / 150 = 0.80
F1 = 2⋅0.60⋅0.80 / 1.40 = 0.686
Accuracy = 890 / 1000 = 0.89
▸ Click to reveal Accuracy 0.91 but recall 0.42. Diagnosis?
Class imbalance. The model is mostly predicting "no churn" — right 91% of the time, but missing more than half the actual churners. Accuracy is hiding the problem. Report precision / recall / F1; lower the threshold or train with a class-weighted loss.
▸ Click to reveal "Why don't we just maximise recall — catch every churner?"
At threshold ≈ 0.05 you'd flag almost everyone — recall ≈ 100%, but precision crashes. The retention team would burn capacity on customers who weren't churning, and lose the high-confidence signal. Threshold isn't a metric to maximise — it's a fit between capacity and cost.
VIII.
A roadmap

L03's scaffolding is reused by every later model.

Pipelines, train / validate splits, precision / recall / F1, threshold choice — the rest of the course swaps the algorithm but keeps the discipline. Hover any tile for "how L03 shows up here".

L01Intro to MLTwo weeks back. The 7-step workflow you're now living in.
L02Prob & StatsLast week. The CI you wrap around every metric here.
L03SupervisedPipelines · CV · metrics · threshold. You are here.
L03L04Sup. AdvancedSame scaffold — the classifier becomes a forest, then a boosted tree.
L03L05UnsupervisedNo labels, no precision/recall — but the no-leakage preprocessing rules are identical.
L03L06Time SeriesTrain / validate becomes train / forecast — the split respects time, the discipline does not change.
L03L07Neural NetsThe fit becomes a training loop — pipeline / metric / threshold stack is unchanged.
L03L08VisionImage classifiers report the same precision / recall / F1 — threshold becomes "operating point".
L03L09NLPprecision@k · recall@k = top-k generalisation of L03 metrics.
L03L10GenAILLM outputs vs. held-out evals — same train / test discipline. Threshold = "which variant ships".