How well-calibrated are these projections?
When a model says "65% chance," the actual win rate should be 65%. Below is the honest version of how close each model gets: per-model status, the size of any known miscalibration, and the correction already applied. Bin charts land as each sport's resolve-worker exposes a public endpoint.
Per-model status
CBB · Win probability
Instrumented · publishing chart soonPAV-isotonic calibrator on top of the v2 model, refit weekly. Latest CV Brier 0.22428 across 3,242 pairs. Calibrator's weak point is the 0.10–0.20 predicted-probability bin where training data is thin (n≈23).
Weather · Temperature bands
Instrumented · publishing chart soonEMOS v1 is mildly overconfident above 30% predicted probability. The live trading layer caps confidence at 0.40 in response. Public charts will show predicted-vs-actual band rates per city per band.
NBA · Totals
Instrumented · publishing chart soonLive totals residual GBM brings out-of-sample MAE from 18.2 to 11.2 (–39%) on a 1,220-game holdout. Vegas-prior τ-blend ships on top to anchor early-game projections.
NBA · Championship
Awaiting season dataPlayoff-series model uses dual-Elo + in-series state. Two full seasons (2024, 2025) backfilled to tune the playoff_total_bias_correction. Calibration chart lands when the 2026 playoffs accumulate enough resolved series.
CFB · Playoff
Awaiting season dataCommittee-proxy seeding learned on 2014–2024 CFP history. Out-of-sample seed accuracy + bracket-walk Brier will land here when the 2026 season starts and we have live committee comparisons.
NFL · Season wins
LiveOff-season ratings calibrated against DraftKings 2026 win totals: 25 of 32 teams within 1.5 wins. Median team within 0.8 wins. The full per-team table is in our internal Vegas-comparison doc; cleaning a public version now.
What "calibrated" means here
Pick a model. Group every one of its predictions by predicted probability -- say in 10pp buckets. For each bucket, compute the actual rate at which the event happened. Plot those points. A perfectly calibrated model has every bucket sitting on the diagonal: where you predicted 60% you got 60%. We'll publish those plots per model here, with bin counts and rolling 30/60/90-day windows. The audit query runs Mondays at 10am PT in cbb_weekly_health_check_worker; parallel weather and NBA audits run nightly.
Why we publish the misses
The two biggest model corrections we've shipped this year -- the EMOS-v1 overconfidence cap on weather and the live NBA totals GBM replacement -- both came out of the calibration process catching the model lying. The framework is the moat. The bin charts are how we prove it's real.