Journal · 2026-W20

2026-W20 · agent findings

Run date: 2026-05-13 · 1 claims emitted

Week 2026-W20 — Blog edge findings

Generated: 2026-05-13 09:10 MDT Cadence: weekly (1-2 posts) Run type: first run; no prior claims to resolve.

Summary

Live findings: 0. The validated-cell whitelist matched 0 markets that ALSO passed the empirical backing gate and the 4-check artifact filter this week. CFB is offseason (kickoff 2026-08-29). MLB F5 markets aren't yet posting (forward-audit on the F5 cell is dated 2026-05-21 in project_mlb_f5_day1_2026_05_07). NBA team-totals only has n=40 resolved — too small for a claim. Weather post-fix predictions FAIL their own calibration check (see Finding 1 below).

Recommended post: ONE falsification piece this week (Finding 1). It's the strongest honest story available and connects to a known memory entry (project_kalshi_weather_calibration_audit_2026_05_12), turning recent internal-research output into reader-facing content.

Finding 1 — Falsification — "Our weather temp model is overconfident; the market is right"

Kind: falsification / calibration retro (NOT a live pick). Hook for the post: "We built our own ensemble weather model. Our model says Kalshi temp markets are systematically underpriced by 8 points. We've run the experiment 475 times. Here's why we're not betting our own model."

The data

Resolved weather temp predictions, post-2026-04-19 NWS-resolver fix (all earlier predictions were corrupted by the low-vs-high bug per project_weather_resolve_bug_2026_04_19):

Model version	n	Wins	Realized WR	Avg market price	Avg model prob	Model − Reality
`pre_emos`	122	22	18.0%	19.5%	31.5%	+13.5pp
`emos_v1`	174	44	25.3%	24.6%	31.8%	+6.5pp
`emos_v2_skill`	118	27	22.9%	25.6%	32.5%	+9.6pp
`emos_v2_obs_features`	20	7	35.0%	30.5%	26.1%	−8.9pp (n too small)
(null mv)	41	13	31.7%	26.3%	34.3%	+2.6pp
Total (post-fix)	475	113	23.8%	23.9%	31.9%	+8.1pp

What this means

Our model has thought weather temp markets were underpriced by 6-13 points in every production version we've run.
Reality (475 resolved bets): the market was correct to 0.1pp. The market wasn't underpriced — our model was overconfident.
This is the same pattern as the "cheap-YES +815% ROI" artifact (feedback_cheap_yes_artifact_2026_05_07): the model emits high probabilities, the price says "no it won't," reality agrees with the price. Three independent model versions, same direction, same magnitude. That's a model problem, not a market problem.
Wilson 95% CI on the pooled 23.8% (n=475): roughly [0.20, 0.28]. The model's claimed 31.9% sits well above the upper bound — not noise, structural overconfidence.

Why it's a good blog post

Counter-narrative. Most edge-detection content is "look at this alpha." This is "here's our model failing in production and what we learned." Trust-builder.
It connects to a falsification we already documented internally (project_kalshi_weather_calibration_audit_2026_05_12 flagged the +30pp finding as likely artifact). The post can walk through the chain: model finds edge → backtest confirms → live resolution refutes → here's the diagnosis.
Reader-actionable framing: "Why we treat 'this market is underpriced' as a hypothesis to test, not a signal to trade."

Suggested angles to dig into

Why does the gap shrink (13.5pp → 6.5pp → 9.6pp) across model versions but never disappear? Calibration improvements helped but didn't fix the structural bias.
Calibration plot: pred-prob bin vs realized WR. The visual will show the model's confidence curve diverging from y=x in the 20-40% bin.
The MAD-recentering shipped 2026-04-06 (project_weather_trust_recenter_2026_04_06) — did it help? Cut by details->>'forecast_source' to see.

Sources

predictions table query, executed 2026-05-13 09:08 MDT
project_kalshi_weather_calibration_audit_2026_05_12.md
feedback_cheap_yes_artifact_2026_05_07.md
project_weather_resolve_bug_2026_04_19.md (the 2026-04-19 cutoff)
project_weather_trust_recenter_2026_04_06.md

Artifact checklist

✓ Exogenous resolution. Predictions resolved against NWS observations via the post-2026-04-19 fixed resolver. feedback_terminal_price_proxy_never not violated.
✓ No close_at proxy used — this analysis is on resolved predictions, not Kalshi orderbook timing.
✓ Post-fix data only. predicted_at >= 2026-04-19. The 289 predictions corrupted by the NWS low-vs-high bug are excluded.
✓ Sample size. n=475 pooled, n=174 on the largest single model version. Wilson CI is tight.

Slate notes (what was rejected and why)

Cell	Live opps (24h)	Resolved n	Verdict
CFB home dog edge≥5 pickem-7	0	n/a	Offseason — kickoff 2026-08-29
CFB edge≥10 pickem-7/14-21	0	n/a	Offseason
Weather EMOS post-fix	7 today	475	FAIL empirical — model overconfident +8.1pp; turned into Finding 1
NBA totals overreaction	needs live in-game	n/a	Pre-scan data can't surface this; needs `/live` watcher signal
NBA team-totals Vegas-divergence gate	22 NBA opps total	40	FAIL n: n=40 < 150, Wilson CI [0.42, 0.71] spans BE 0.519
CBB CWS futures	0	n/a	CWS markets not posted yet (selection late May)
CBB ATS road-fav -1.5 (paper)	DK CBB spread present	0 resolved	Resolver hasn't run on `cbb_dk_spread` — data-pipeline issue (see below)
MLB KXMLBF5 winner	0	0	F5 markets not in opportunities slate; forward-audit dated 2026-05-21

Data-pipeline issue surfaced this run

cbb_dk_spread and cbb_dk_totals have 295 predictions but 0 resolved. The resolver isn't processing DK CBB predictions. This blocks the ATS road-favorite cell from ever producing a hit-rate claim until fixed. Not a blog topic — internal-fix ticket.

Likely lives in polyedge/workers/resolve_worker.py — the working tree shows it's currently modified (git status flagged it). Worth checking that the in-progress changes don't drop DK CBB resolution.

What didn't make it (and why)

NBA spread edges from the live slate (KXNBASPREAD-26MAY13CLEDET-* cluster — 5 of the top 22 NBA opps were the same CLE@DET game with different point lines). These are model-vs-market spread snapshots with no whitelist cell behind them. The user's memory has banned proposing edges purely from edge > X filters without a validated pattern. Skipped.
MLB total Unders (3 of top 5 MLB edges). MLB totals had a catastrophic falsification audit (project_kalshi_mlb_totals_overhaul_2026_04_04 — no significant results at 70 games). Not on whitelist. Skipped.
CBB DK spread edges +31% on Mercer/GT — CBB regular-season DK isn't on the whitelist per project_cbb_cws_pivot_2026_05_12 (pivot to CWS futures). Skipped despite headline size.