2026-W20 · agent findings
Week 2026-W20 — Blog edge findings
Generated: 2026-05-13 09:10 MDT Cadence: weekly (1-2 posts) Run type: first run; no prior claims to resolve.
Summary
Live findings: 0. The validated-cell whitelist matched 0 markets
that ALSO passed the empirical backing gate and the 4-check artifact
filter this week. CFB is offseason (kickoff 2026-08-29). MLB F5
markets aren't yet posting (forward-audit on the F5 cell is dated
2026-05-21 in project_mlb_f5_day1_2026_05_07). NBA team-totals only
has n=40 resolved — too small for a claim. Weather post-fix
predictions FAIL their own calibration check (see Finding 1 below).
Recommended post: ONE falsification piece this week (Finding 1).
It's the strongest honest story available and connects to a known
memory entry (project_kalshi_weather_calibration_audit_2026_05_12),
turning recent internal-research output into reader-facing content.
Finding 1 — Falsification — "Our weather temp model is overconfident; the market is right"
Kind: falsification / calibration retro (NOT a live pick). Hook for the post: "We built our own ensemble weather model. Our model says Kalshi temp markets are systematically underpriced by 8 points. We've run the experiment 475 times. Here's why we're not betting our own model."
The data
Resolved weather temp predictions, post-2026-04-19 NWS-resolver fix
(all earlier predictions were corrupted by the low-vs-high bug per
project_weather_resolve_bug_2026_04_19):
| Model version | n | Wins | Realized WR | Avg market price | Avg model prob | Model − Reality |
|---|---|---|---|---|---|---|
pre_emos | 122 | 22 | 18.0% | 19.5% | 31.5% | +13.5pp |
emos_v1 | 174 | 44 | 25.3% | 24.6% | 31.8% | +6.5pp |
emos_v2_skill | 118 | 27 | 22.9% | 25.6% | 32.5% | +9.6pp |
emos_v2_obs_features | 20 | 7 | 35.0% | 30.5% | 26.1% | −8.9pp (n too small) |
| (null mv) | 41 | 13 | 31.7% | 26.3% | 34.3% | +2.6pp |
| Total (post-fix) | 475 | 113 | 23.8% | 23.9% | 31.9% | +8.1pp |
What this means
- Our model has thought weather temp markets were underpriced by 6-13 points in every production version we've run.
- Reality (475 resolved bets): the market was correct to 0.1pp. The market wasn't underpriced — our model was overconfident.
- This is the same pattern as the "cheap-YES +815% ROI" artifact
(
feedback_cheap_yes_artifact_2026_05_07): the model emits high probabilities, the price says "no it won't," reality agrees with the price. Three independent model versions, same direction, same magnitude. That's a model problem, not a market problem. - Wilson 95% CI on the pooled 23.8% (n=475): roughly [0.20, 0.28]. The model's claimed 31.9% sits well above the upper bound — not noise, structural overconfidence.
Why it's a good blog post
- Counter-narrative. Most edge-detection content is "look at this alpha." This is "here's our model failing in production and what we learned." Trust-builder.
- It connects to a falsification we already documented internally
(
project_kalshi_weather_calibration_audit_2026_05_12flagged the +30pp finding as likely artifact). The post can walk through the chain: model finds edge → backtest confirms → live resolution refutes → here's the diagnosis. - Reader-actionable framing: "Why we treat 'this market is underpriced' as a hypothesis to test, not a signal to trade."
Suggested angles to dig into
- Why does the gap shrink (13.5pp → 6.5pp → 9.6pp) across model versions but never disappear? Calibration improvements helped but didn't fix the structural bias.
- Calibration plot: pred-prob bin vs realized WR. The visual will show the model's confidence curve diverging from y=x in the 20-40% bin.
- The MAD-recentering shipped 2026-04-06
(
project_weather_trust_recenter_2026_04_06) — did it help? Cut bydetails->>'forecast_source'to see.
Sources
predictionstable query, executed 2026-05-13 09:08 MDTproject_kalshi_weather_calibration_audit_2026_05_12.mdfeedback_cheap_yes_artifact_2026_05_07.mdproject_weather_resolve_bug_2026_04_19.md(the 2026-04-19 cutoff)project_weather_trust_recenter_2026_04_06.md
Artifact checklist
- ✓ Exogenous resolution. Predictions resolved against NWS
observations via the post-2026-04-19 fixed resolver.
feedback_terminal_price_proxy_nevernot violated. - ✓ No
close_atproxy used — this analysis is on resolved predictions, not Kalshi orderbook timing. - ✓ Post-fix data only.
predicted_at >= 2026-04-19. The 289 predictions corrupted by the NWS low-vs-high bug are excluded. - ✓ Sample size. n=475 pooled, n=174 on the largest single model version. Wilson CI is tight.
Slate notes (what was rejected and why)
| Cell | Live opps (24h) | Resolved n | Verdict |
|---|---|---|---|
| CFB home dog edge≥5 pickem-7 | 0 | n/a | Offseason — kickoff 2026-08-29 |
| CFB edge≥10 pickem-7/14-21 | 0 | n/a | Offseason |
| Weather EMOS post-fix | 7 today | 475 | FAIL empirical — model overconfident +8.1pp; turned into Finding 1 |
| NBA totals overreaction | needs live in-game | n/a | Pre-scan data can't surface this; needs /live watcher signal |
| NBA team-totals Vegas-divergence gate | 22 NBA opps total | 40 | FAIL n: n=40 < 150, Wilson CI [0.42, 0.71] spans BE 0.519 |
| CBB CWS futures | 0 | n/a | CWS markets not posted yet (selection late May) |
| CBB ATS road-fav -1.5 (paper) | DK CBB spread present | 0 resolved | Resolver hasn't run on cbb_dk_spread — data-pipeline issue (see below) |
| MLB KXMLBF5 winner | 0 | 0 | F5 markets not in opportunities slate; forward-audit dated 2026-05-21 |
Data-pipeline issue surfaced this run
cbb_dk_spread and cbb_dk_totals have 295 predictions but 0
resolved. The resolver isn't processing DK CBB predictions. This
blocks the ATS road-favorite cell from ever producing a hit-rate
claim until fixed. Not a blog topic — internal-fix ticket.
Likely lives in polyedge/workers/resolve_worker.py — the working
tree shows it's currently modified (git status flagged it). Worth
checking that the in-progress changes don't drop DK CBB resolution.
What didn't make it (and why)
- NBA spread edges from the live slate (
KXNBASPREAD-26MAY13CLEDET-*cluster — 5 of the top 22 NBA opps were the same CLE@DET game with different point lines). These are model-vs-market spread snapshots with no whitelist cell behind them. The user's memory has banned proposing edges purely fromedge > Xfilters without a validated pattern. Skipped. - MLB total Unders (3 of top 5 MLB edges). MLB totals had a
catastrophic falsification audit (
project_kalshi_mlb_totals_overhaul_2026_04_04— no significant results at 70 games). Not on whitelist. Skipped. - CBB DK spread edges +31% on Mercer/GT — CBB regular-season DK
isn't on the whitelist per
project_cbb_cws_pivot_2026_05_12(pivot to CWS futures). Skipped despite headline size.