How Sure Are We?L02
Interactive · Field Notes
Chapter Two · The Friday afternoon question

"But how sure are we?"

Priya is reading Sarah's slide deck. The model says 60% of reviews are positive — a clean number on a clean chart. She squints. But are the positive ones actually positive? How do we know we can trust that number? Sarah doesn't have an answer. This lesson is how she gets one: three lenses for honest numbers — distributions, confidence intervals, hypothesis tests.

3
Lenses to add
95%
Confidence by default
n ≥ 30
CLT rule of thumb

Scroll. Click. Drag. Every figure here is live.
I.
A first lens

The number lies. The shape tells the truth.

A mean is a one-line story written by the data. Sometimes the data has a different story to tell. Toggle the shape below — watch the mean (red) and the median (teal) part ways.

Quick reminders: The mean is the arithmetic average — sum of all values divided by the count. The median is the middle value when values are sorted from lowest to highest. The standard deviation measures how spread out values are around the mean — a small SD means values cluster tightly; a large SD means they are widely spread. When data is skewed, mean and median diverge — and that gap is the story.
Mean
Median
St. dev
II.
Extension topic · A second lens

Z-scores. The universal anomaly meter.

z = (value − μ) / σ — here μ (mu) is the population mean and σ (sigma) is the standard deviation — normalises "is this unusual?" across any scale. The working rule: |z| > 3 is an outlier worth checking.

In a normal distribution, only about 0.27% of observations sit further than 3σ from the mean — roughly one day per year of business. Use the bands below as a guide.

|z| < 1Within 1σ — 68% of values live here. Entirely normal.
1 ≤ |z| < 2Mild deviation — worth a glance, not an alert.
2 ≤ |z| < 3Suspicious — only ~4.5% of values land here. Investigate the source.
|z| ≥ 3Outlier — <0.27% of values. Worth checking immediately.

Z-score

III.
Extension topic · A third lens

The Central Limit Theorem — the engine.

Even when the raw data is messy — skewed, bimodal, or irregular — the distribution of sample means converges to a bell curve as sample size grows. That is why statistics works on real-world data. Draw samples below and watch it happen.

Why does this matter? Once sample means follow a normal distribution, you can calculate standard errors, build confidence intervals, and run hypothesis tests — even when the raw data is far from normal. This is the theoretical foundation that makes the tools in Sections IV and V work.

The raw data · very right-skewed

NorthStar order amounts: most under £100, a long tail to £10,000+. Nothing remotely bell-shaped. We cannot apply normal-distribution formulas to this directly.

The sample means · one dot per draw

samples drawn: 0

After ~30 draws a bell begins to appear. After ~100, it is unmistakable. The CLT is not metaphor — you are watching it. Once the sample means are normal, we can apply standard error and CI formulas to them.

IV.
A reporting rule

Every number from a sample owes you a range.

"84% accuracy" is half a fact. "84% (95% CI: 78–89%) on 200 hand-labelled reviews" is the honest version. Slide the sample size below — reducing uncertainty has a price, and that price is roughly: quadruple n to halve the width.

Why? Each additional data point gives the estimate a little more stability. With more observations, the random variation in any one sample matters less, so the range of plausible values narrows. Formally, the margin of error shrinks with the square root of n — which is why doubling n only shrinks the interval by about 30%, not 50%.

Sample size · 200

n
200
95% CI
78 – 89%
Half-width
± 5.5pp

Reported accuracy is fixed at 84% for this demo — only sample size changes. The math: half-width ≈ 1.96 × √(p(1−p)/n).

The 95% is about the method, not this one interval. If you computed CIs this way 100 times, ~95 of them would contain the true value. You cannot say this specific interval has a 95% probability.

V.
A decision lens

Hypothesis testing. p-values are half the story.

A p-value answers "could chance produce this difference?" But statistical significance is not practical significance. Slide the two dimensions and read the quadrant. The coupon experiment: H₀ = no effect on 30-day repurchase rate.

0.001 (rock-solid)0.500 (random)
0.01 pp (tiny)10 pp (huge)

With a large enough sample, a 0.01% difference can become "significant." Always report effect size alongside p — and let business decide whether the effect is worth shipping.

Verdict
HUGETINY
SIG · huge
NOISE · huge?
SIG · tiny
NULL
SIGNIFICANTNOT SIGNIFICANT
VI.
A checklist

Three checks before you cite a number.

Skip any one of these and you have confident-sounding nonsense — the most expensive kind of mistake in business.

1

What is the shape?

If the distribution is skewed, the mean misleads. Report the median, segment the data, or be explicit about which summary you chose — and why.

✕ "Average spend is £85." ✓ "Median spend is £42 (mean £85 is pulled up by a corporate-account long tail)."
2

What is the range?

Every number from a sample needs a confidence interval. The point estimate alone is dishonest by omission.

✕ "Model accuracy is 84%." ✓ "84% (95% CI: 78–89%) on 200 hand-labelled reviews."
3

Could it be noise?

When you compare two groups, you owe the reader both a p-value and an effect size. Either alone misleads.

✕ "p = 0.001 — ship it." ✓ "p = 0.001, effect = +2.3pp conversion. Worth the dev cost? Yes."
VII.
Check yourself

Ten questions from the field notebook.

Try each one mentally first. Click to flip and read the sample answer. Numbering matches lesson.md.

▸ Click to reveal "Average spend per order is £85" — but the distribution is right-skewed. Right number to report?
No. A few large corporate orders are pulling the mean up. Report the median instead (resistant to outliers). Better still: segment — report individual-customer median separately from corporate-account median, with a sentence on why they differ.
▸ Click to reveal Daily return rate μ=3.2%, σ=0.8%. Today is 5.6%. Anomaly?
z = (5.6 − 3.2) / 0.8 = 3.0. In a normal distribution, only ~0.27% of days fall this far out by chance — about one day per year. Worth investigating: possible product defect, batch quality issue, or data error.
▸ Click to reveal Fraud-detection training data is extremely right-skewed. Why is that a problem — and what helps?
One or two huge values can dominate the loss function and distort learning. Common fixes: log transform the amount (more symmetric) or winsorise (cap at 99th percentile). Either reduces the long tail's influence without throwing data away.
▸ Click to reveal Model A: 80% acc on 50 reviews, CI 68–92%. Model B: 80% acc on 2,000, CI 78–82%. Trust which?
Model B. Not because accuracy differs (both 80%) but because the CI is much tighter. The first spans 24pp — both "barely acceptable" and "excellent" are inside. The second is tight enough to act on. Sample size is the only difference.
▸ Click to reveal "The CLT only works if the underlying data is normal." True?
No — backwards. The CLT says sample means become normal regardless of the underlying shape. The raw data can be skewed, bimodal, ugly. As long as n is large enough (rule of thumb n ≥ 30), the distribution of sample means is approximately normal. That is what makes the CLT powerful.
▸ Click to reveal "If we ran the test again, would we get 84% again?" How should Sarah answer?
Probably not exactly. The CI (78–89%) is Sarah's way of saying: across many samples of 200, accuracy would land somewhere in this range. About 95 of 100 such CIs would contain the true accuracy. The 84% is the best estimate from this particular sample.
▸ Click to reveal Design a hypothesis test for the "easy returns" banner. State H₀, H₁, the outcome metric, and what makes the result actionable.
H₀: the banner has no effect on 30-day return rate. H₁: the banner reduces the 30-day return rate. Control: no banner. Treatment: random users see the banner. Outcome metric: 30-day return rate (one metric — not three at once). Actionable if: p < 0.05 and the effect size justifies the build cost. A 0.1pp vs a 3pp drop have very different business implications.
▸ Click to reveal p = 0.001. "Ship it!" What question must you ask first?
"What is the actual size of the effect?" p = 0.001 on a huge sample might be a 0.01% improvement — statistically rock-solid, practically worthless. Statistical significance is not business significance.
▸ Click to reveal p = 0.12. "The null is confirmed — no effect." Correct the interpretation.
p = 0.12 means we failed to reject the null — not the same as proving no effect. The sample may have been too small, or the effect too small to detect at this n. Correct: "Not statistically significant at the 0.05 level; we cannot conclude the intervention does nothing — only that this test was not powerful enough."
▸ Click to reveal Three numbers to interpret for a non-technical executive: (a) 60% positive (b) 84% acc, CI 78–89% (c) p < 0.01, effect = +7.5pp repurchase rate.
(a) "In this week's 10,000 reviews, 60% classified positive — treat as an estimate, not a precise count; the model occasionally misclassifies borderline cases."
(b) "Against 200 hand-labelled reviews, the model was right 84% of the time; the true accuracy is most likely between 78% and 89%."
(c) "Strong statistical evidence that the apology coupon increases 30-day repurchase rates by approximately 7.5 percentage points — results this large would occur by chance less than 1% of the time if the coupon had no effect. Whether the coupon cost is justified by the increase in returning customers is a business decision."
VIII.
A roadmap

L02 is the lens every later lesson is read through.

Whenever you report a model number, claim a difference, or detect an anomaly — for the rest of the course — you are using L02 tools. Hover any tile for the one-line "how L02 shows up here".

L01Intro to MLLast week. The vocabulary; the 7-step workflow.
L02Prob & StatsDistributions · CIs · A/B testing. You are here.
L02L03SupervisedEvery accuracy / precision / recall is a sample statistic — CIs everywhere.
L02L04Sup. AdvancedHyperparam tuning: did model A really beat B, or just a lucky split?
L02L05UnsupervisedAnomaly detection is a Z-score generalisation; cluster metrics need CIs.
L02L06Time SeriesForecast intervals = CIs projected forward.
L02L07Neural NetsValidation curves are noisy samples — CIs decide architecture winners.
L02L08VisionTest-set accuracy on images is a sample proportion — same math.
L02L09NLPprecision@k, recall@k are sampled and need CIs.
L02L10GenAIHypothesis testing of prompts is the only rigorous way to claim an LLM pipeline improved.