Chapter Three · northstar_churn.csv

Ten thousand customers,
and a column called churned.

Marcus sets the brief on Monday morning. Off-the-shelf is over — can Sarah train NorthStar's own model on NorthStar's own data? She opens northstar_churn.csv: 10,000 customers, 11 features, one target. By Friday she will hold three new tools: a clean preprocessing pipeline, cross-validated training, and a threshold chosen by the business, not the math.

Feature columns

12%

Positive class rate

Cross-val folds

Sarah ChenCust. Experience Analyst · Jan 2023

Toggle the buttons. Move the slider. Every figure here is live.

The 60-70% rule

Most of an ML week is preprocessing.

Real CSVs are messy. Missing cells, mixed scales, free-text categories. Click the four buttons below to walk a raw NorthStar row through the preprocessing it needs before any model sees it.

northstar_churn.csv · head(5)raw — nothing applied

tier	avg_spend	tenure_m	reviews	churned

II.

The leakage trap

Fit the scaler on the training data only.

The most common silent bug in a supervised week. Mean and standard deviation that include test rows are pretending to know the future. Two orderings, two outcomes.

✕ Wrong · Leak

Fit, then split.

fit StandardScaler on all 10,000 rows

train_test_split(X_scaled, y) · 80 / 20

fit model on training portion

evaluate on test portion — looks great!

The scaler used test-row statistics. At deployment, future test rows do not exist yet. The evaluation has quietly cheated — production accuracy will be lower than the notebook says.

✓ Right · Clean

Split, then fit.

train_test_split(X, y) · 80 / 20 · stratify=y

fit StandardScaler on X_train only

transform X_train and X_test with the same scaler

fit model on X_train · evaluate on X_test — honest

The test set is treated as the future. Better still: wrap everything in a Pipeline so this discipline is enforced automatically — including across cross-validation folds.

# The pattern every supervised model in M3 starts with
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute        import SimpleImputer

prep = ColumnTransformer([
    ("num", Pipeline([("imp", SimpleImputer(strategy="median")),
                       ("sc",  StandardScaler())]),         numeric_cols),
    ("cat", OneHotEncoder(handle_unknown="ignore"),       categorical_cols),
])

model = Pipeline([("prep", prep), ("clf", LogisticRegression())])
model.fit(X_train, y_train)   # prep is re-fit per CV fold automatically

III.

A discipline

One split is a guess. K folds are an estimate.

A single 80/20 split changes its mind every time you change the random seed. Cross-validation averages across folds so the number on your slide is signal, not coincidence.

Single split, five seeds.

Same model, same data, different random_state. Which one do you put on the slide?

seed 0

train

test

86.4%

seed 1

train

test

89.1%

seed 2

train

test

91.3%

seed 3

train

test

88.7%

seed 4

train

test

89.5%

Range: 86.4 – 91.3% Spread: 4.9 pp Mean ± sd: 89.0 ± 1.6%

A 4.9pp swing between seeds means the test set is too small to trust a single number. One run is a coin flip dressed up as evaluation.

K-fold cross-validation.

Each row gets to be in validation exactly once. Preprocessing is re-fit per fold — never trained on its own validator.

Fold 1

val

87.8%

Fold 2

val

89.5%

Fold 3

val

88.1%

Fold 4

val

89.9%

Fold 5

val

88.4%

5-fold mean: 88.7% Std: ± 0.8 pp Report this — with CI

Tighter spread, more honest. The mean is the headline; the spread tells you how seriously to take it. Pair it with the L02 confidence interval.

IV.

A diagnostic

Accuracy lies. Move the threshold.

On NorthStar's 12% churn rate, a model that predicts "no churn" for everyone scores 88% accuracy — and is useless. The confusion matrix tells the truth. Drag the threshold and watch precision and recall trade places.

Model output · 200 held-out customersthreshold = 0.50

← flag fewer · high precisionflag more · high recall →

Each tick is one customer. Vermillion = will actually churn. Teal = will not. The vertical line is the threshold.

Predicted churn

Predicted stay

Actually churned

—

TP · caught

—

FN · missed

Actually stayed

—

FP · wasted call

—

TN · correct skip

Accuracy

—

Precision

—

Recall

—

A business decision

0.5 is rarely right. Capacity and cost decide.

The retention team can call 80 customers a week. A missed churner costs £240 of lifetime value; a wasted call costs £15 of analyst time. The right threshold is the one that fits both.

Two business inputs

Weekly capacity · 80 calls

20300

FN cost · missed churner · £240

£40£600

FP cost · wasted call · £15

£2£120

Capacity caps flag volume. Cost ratio drives where the boundary should sit — cheap mistakes get more headroom, expensive ones drive the threshold up or down.

Recommended threshold

—

flags / week: — · capacity: —

—

Expected weekly cost · —

VI.

A checklist

Three checks before you ship a threshold.

Skip any of these and you'll either drown the team in low-value flags or quietly burn your held-out evaluation.

What is the capacity?

How many flagged cases can the team actually act on per week? The threshold's job is to fit inside that capacity — not to maximise a metric.

✕ "F1 is highest at threshold 0.32." ✓ "Retention can call 80/week. Threshold 0.62 flags 78."

What is the cost?

A missed churner and a wasted call are rarely equally expensive. The cheaper error gets more headroom; the costly one drives the threshold.

✕ "Default 0.5 — balanced, right?" ✓ "FN = £240, FP = £15. Lower the threshold."

Which data chose it?

If you tuned the threshold to maximise F1 on the held-out test set, you've used that data to make a decision — the evaluation is no longer clean.

✕ "Searched 0.0–1.0 on test — best F1 wins." ✓ "Pick from capacity + cost. Then report metrics."

VII.

Check yourself

Nine questions from the field notebook.

Try each mentally first. Click to flip the card. Numbering matches lesson.md.

▸ Click to reveal subscription_tier · free / basic / premium. Label-encode or one-hot?

It's ordinal, so label encoding is defensible. But one-hot doesn't hurt and is safer — it doesn't force the model to assume free→basic equals basic→premium. For a 3-tier ordinal, either is fine.

▸ Click to reveal avg_review_polarity missing for 30% of customers. Best imputation?

Missingness is informative — non-reviewers behave differently. Add a was_missing_polarity flag and impute (median, or 0). Keep both signals.

▸ Click to reveal You StandardScaler.fit on the full dataset, then split. What's wrong?

You've leaked test-set information into training. The scaler's mean / sd include test rows — rows that, at deployment, won't exist yet. Always: split first, fit the scaler on training only, transform both.

▸ Click to reveal Why stratify=y on imbalanced data?

Without stratification, a random split could give a test set with very few (or very many) positives by luck. Stratifying guarantees train and test mirror the full class proportions, so the evaluation actually means something.

▸ Click to reveal 80/20 split · seed 0 → 87%. Seed 7 → 91%. Which number do you report?

Neither. A 4-point swing between seeds means a single split is too noisy. Run 5-fold cross-validation and report mean ± std.

▸ Click to reveal cross_val_score(pipeline, X_train, y_train) — does sklearn re-fit preprocessing per fold?

Yes — precisely why preprocessing must live inside the pipeline. Each fold becomes its own train/val split; the scaler, imputer, encoder are re-fit on the fold's training portion only. Without the pipeline, you'd leak silently.

▸ Click to reveal TP=120, FP=80, FN=30, TN=770. Compute P / R / F1 / Acc.

Precision = 120 / 200 = 0.60
Recall = 120 / 150 = 0.80
F1 = 2⋅0.60⋅0.80 / 1.40 = 0.686
Accuracy = 890 / 1000 = 0.89

▸ Click to reveal Accuracy 0.91 but recall 0.42. Diagnosis?

Class imbalance. The model is mostly predicting "no churn" — right 91% of the time, but missing more than half the actual churners. Accuracy is hiding the problem. Report precision / recall / F1; lower the threshold or train with a class-weighted loss.

▸ Click to reveal "Why don't we just maximise recall — catch every churner?"

At threshold ≈ 0.05 you'd flag almost everyone — recall ≈ 100%, but precision crashes. The retention team would burn capacity on customers who weren't churning, and lose the high-confidence signal. Threshold isn't a metric to maximise — it's a fit between capacity and cost.

VIII.

A roadmap

L03's scaffolding is reused by every later model.

Pipelines, train / validate splits, precision / recall / F1, threshold choice — the rest of the course swaps the algorithm but keeps the discipline. Hover any tile for "how L03 shows up here".

L01Intro to MLTwo weeks back. The 7-step workflow you're now living in.

L02Prob & StatsLast week. The CI you wrap around every metric here.

L03SupervisedPipelines · CV · metrics · threshold. You are here.

L03L04Sup. AdvancedSame scaffold — the classifier becomes a forest, then a boosted tree.

L03L05UnsupervisedNo labels, no precision/recall — but the no-leakage preprocessing rules are identical.

L03L06Time SeriesTrain / validate becomes train / forecast — the split respects time, the discipline does not change.

L03L07Neural NetsThe fit becomes a training loop — pipeline / metric / threshold stack is unchanged.

L03L08VisionImage classifiers report the same precision / recall / F1 — threshold becomes "operating point".

L03L09NLPprecision@k · recall@k = top-k generalisation of L03 metrics.

L03L10GenAILLM outputs vs. held-out evals — same train / test discipline. Threshold = "which variant ships".

Ten thousand customers,and a column called churned.

Most of an ML week is preprocessing.

Fit the scaler on the training data only.

Fit, then split.

Split, then fit.

One split is a guess. K folds are an estimate.

Single split, five seeds.

K-fold cross-validation.

Accuracy lies. Move the threshold.

0.5 is rarely right. Capacity and cost decide.

Two business inputs

Three checks before you ship a threshold.

What is the capacity?

What is the cost?

Which data chose it?

Nine questions from the field notebook.

L03's scaffolding is reused by every later model.

Ten thousand customers,
and a column called churned.