L02 — How Sure Are We? · Key Concepts

II.

Extension topic · A second lens

Z-scores. The universal anomaly meter.

z = (value − μ) / σ — here μ (mu) is the population mean and σ (sigma) is the standard deviation — normalises "is this unusual?" across any scale. The working rule: |z| > 3 is an outlier worth checking.

Today's value

Population mean (μ)

Standard deviation (σ)

In a normal distribution, only about 0.27% of observations sit further than 3σ from the mean — roughly one day per year of business. Use the bands below as a guide.

\|z\| < 1	Within 1σ — 68% of values live here. Entirely normal.
1 ≤ \|z\| < 2	Mild deviation — worth a glance, not an alert.
2 ≤ \|z\| < 3	Suspicious — only ~4.5% of values land here. Investigate the source.
\|z\| ≥ 3	Outlier — <0.27% of values. Worth checking immediately.

Z-score

—

III.

Extension topic · A third lens

The Central Limit Theorem — the engine.

Even when the raw data is messy — skewed, bimodal, or irregular — the distribution of sample means converges to a bell curve as sample size grows. That is why statistics works on real-world data. Draw samples below and watch it happen.

Why does this matter? Once sample means follow a normal distribution, you can calculate standard errors, build confidence intervals, and run hypothesis tests — even when the raw data is far from normal. This is the theoretical foundation that makes the tools in Sections IV and V work.

The raw data · very right-skewed

NorthStar order amounts: most under £100, a long tail to £10,000+. Nothing remotely bell-shaped. We cannot apply normal-distribution formulas to this directly.

The sample means · one dot per draw

samples drawn: 0

After ~30 draws a bell begins to appear. After ~100, it is unmistakable. The CLT is not metaphor — you are watching it. Once the sample means are normal, we can apply standard error and CI formulas to them.

IV.

A reporting rule

Every number from a sample owes you a range.

"84% accuracy" is half a fact. "84% (95% CI: 78–89%) on 200 hand-labelled reviews" is the honest version. Slide the sample size below — reducing uncertainty has a price, and that price is roughly: quadruple n to halve the width.

Why? Each additional data point gives the estimate a little more stability. With more observations, the random variation in any one sample matters less, so the range of plausible values narrows. Formally, the margin of error shrinks with the square root of n — which is why doubling n only shrinks the interval by about 30%, not 50%.

Sample size · 200

200

95% CI

78 – 89%

Half-width

± 5.5pp

Reported accuracy is fixed at 84% for this demo — only sample size changes. The math: half-width ≈ 1.96 × √(p(1−p)/n).

The 95% is about the method, not this one interval. If you computed CIs this way 100 times, ~95 of them would contain the true value. You cannot say this specific interval has a 95% probability.

VI.

A checklist

Three checks before you cite a number.

Skip any one of these and you have confident-sounding nonsense — the most expensive kind of mistake in business.

What is the shape?

If the distribution is skewed, the mean misleads. Report the median, segment the data, or be explicit about which summary you chose — and why.

✕ "Average spend is £85." ✓ "Median spend is £42 (mean £85 is pulled up by a corporate-account long tail)."

What is the range?

Every number from a sample needs a confidence interval. The point estimate alone is dishonest by omission.

✕ "Model accuracy is 84%." ✓ "84% (95% CI: 78–89%) on 200 hand-labelled reviews."

Could it be noise?

When you compare two groups, you owe the reader both a p-value and an effect size. Either alone misleads.

✕ "p = 0.001 — ship it." ✓ "p = 0.001, effect = +2.3pp conversion. Worth the dev cost? Yes."

VII.

Check yourself

Ten questions from the field notebook.

Try each one mentally first. Click to flip and read the sample answer. Numbering matches lesson.md.

▸ Click to reveal "Average spend per order is £85" — but the distribution is right-skewed. Right number to report?

No. A few large corporate orders are pulling the mean up. Report the median instead (resistant to outliers). Better still: segment — report individual-customer median separately from corporate-account median, with a sentence on why they differ.

▸ Click to reveal Daily return rate μ=3.2%, σ=0.8%. Today is 5.6%. Anomaly?

z = (5.6 − 3.2) / 0.8 = 3.0. In a normal distribution, only ~0.27% of days fall this far out by chance — about one day per year. Worth investigating: possible product defect, batch quality issue, or data error.

▸ Click to reveal Fraud-detection training data is extremely right-skewed. Why is that a problem — and what helps?

One or two huge values can dominate the loss function and distort learning. Common fixes: log transform the amount (more symmetric) or winsorise (cap at 99th percentile). Either reduces the long tail's influence without throwing data away.

▸ Click to reveal Model A: 80% acc on 50 reviews, CI 68–92%. Model B: 80% acc on 2,000, CI 78–82%. Trust which?

Model B. Not because accuracy differs (both 80%) but because the CI is much tighter. The first spans 24pp — both "barely acceptable" and "excellent" are inside. The second is tight enough to act on. Sample size is the only difference.

▸ Click to reveal "The CLT only works if the underlying data is normal." True?

No — backwards. The CLT says sample means become normal regardless of the underlying shape. The raw data can be skewed, bimodal, ugly. As long as n is large enough (rule of thumb n ≥ 30), the distribution of sample means is approximately normal. That is what makes the CLT powerful.

▸ Click to reveal "If we ran the test again, would we get 84% again?" How should Sarah answer?

Probably not exactly. The CI (78–89%) is Sarah's way of saying: across many samples of 200, accuracy would land somewhere in this range. About 95 of 100 such CIs would contain the true accuracy. The 84% is the best estimate from this particular sample.

▸ Click to reveal Design a hypothesis test for the "easy returns" banner. State H₀, H₁, the outcome metric, and what makes the result actionable.

H₀: the banner has no effect on 30-day return rate. H₁: the banner reduces the 30-day return rate. Control: no banner. Treatment: random users see the banner. Outcome metric: 30-day return rate (one metric — not three at once). Actionable if: p < 0.05 and the effect size justifies the build cost. A 0.1pp vs a 3pp drop have very different business implications.

▸ Click to reveal p = 0.001. "Ship it!" What question must you ask first?

"What is the actual size of the effect?" p = 0.001 on a huge sample might be a 0.01% improvement — statistically rock-solid, practically worthless. Statistical significance is not business significance.

▸ Click to reveal p = 0.12. "The null is confirmed — no effect." Correct the interpretation.

p = 0.12 means we failed to reject the null — not the same as proving no effect. The sample may have been too small, or the effect too small to detect at this n. Correct: "Not statistically significant at the 0.05 level; we cannot conclude the intervention does nothing — only that this test was not powerful enough."

▸ Click to reveal Three numbers to interpret for a non-technical executive: (a) 60% positive (b) 84% acc, CI 78–89% (c) p < 0.01, effect = +7.5pp repurchase rate.

(a) "In this week's 10,000 reviews, 60% classified positive — treat as an estimate, not a precise count; the model occasionally misclassifies borderline cases."
(b) "Against 200 hand-labelled reviews, the model was right 84% of the time; the true accuracy is most likely between 78% and 89%."
(c) "Strong statistical evidence that the apology coupon increases 30-day repurchase rates by approximately 7.5 percentage points — results this large would occur by chance less than 1% of the time if the coupon had no effect. Whether the coupon cost is justified by the increase in returning customers is a business decision."

VIII.

A roadmap

L02 is the lens every later lesson is read through.

Whenever you report a model number, claim a difference, or detect an anomaly — for the rest of the course — you are using L02 tools. Hover any tile for the one-line "how L02 shows up here".

L01Intro to MLLast week. The vocabulary; the 7-step workflow.

L02Prob & StatsDistributions · CIs · A/B testing. You are here.

L02L03SupervisedEvery accuracy / precision / recall is a sample statistic — CIs everywhere.

L02L04Sup. AdvancedHyperparam tuning: did model A really beat B, or just a lucky split?

L02L05UnsupervisedAnomaly detection is a Z-score generalisation; cluster metrics need CIs.

L02L06Time SeriesForecast intervals = CIs projected forward.

L02L07Neural NetsValidation curves are noisy samples — CIs decide architecture winners.

L02L08VisionTest-set accuracy on images is a sample proportion — same math.

L02L09NLPprecision@k, recall@k are sampled and need CIs.

L02L10GenAIHypothesis testing of prompts is the only rigorous way to claim an LLM pipeline improved.

"But how sure are we?"

The number lies. The shape tells the truth.