01 · Probabilistic forecasting — UW–Madison research
World Cup 2026 Bayesian Forecaster
A PyTorch policy-gradient network with a Bayesian neural network head (Monte Carlo weight sampling), an XGBoost weather sub-model, temperature-scaled calibration, and a Monte Carlo tournament layer — served via FastAPI in the real system. Brier score 0.24 → 0.176.
Everything here is reproducible — same result on every click, because the model is fixed. ▶ Play the predicted World Cup animates the model's representative tournament — groups, bracket, champion — from a fixed random seed, so it always ends the same way. 📊 The model's forecast is the formal prediction: 5,000 seeded Monte Carlo futures aggregated into calibrated title probabilities, because a probabilistic model's answer is a distribution, not one name. ⚡ Run 5,000 re-runs everything with fresh dice — watch the numbers land on the official ◇ odds again.
world champion probability · ▮ live simulation vs ◇ official 5,000-run distribution
Group results sample directly from the calibrated ensemble's real per-match probabilities. Knockout matchups use team strengths fitted to my official Monte Carlo run (◇); bracket pairings are approximated.
A broadcast-style match center for all 72 group fixtures — except the win probabilities are the exact outputs of the three model variants: the standard network (SNN), the Bayesian head (BNN), and the calibrated ensemble that blends them with XGBoost weather features.
A Bayesian neural network doesn't have fixed weights — it has distributions over weights. Each forward pass samples a slightly different network. Hundreds of passes produce a spread of predictions: that spread is the model telling you how much to trust it.
Calibration means: when the model says 70%, it should be right ~70% of the time. Temperature scaling + the BNN head moved the Brier score from 0.24 to 0.176 and recall to 84%, validated with k-fold CV and McNemar tests, with diagnostics logged to PostgreSQL.
The predictions on this page were locked in before the tournament. From June 11, a GitHub Action pulls official results every 6 hours and this scoreboard grades every group-stage prediction automatically — accuracy, live Brier score, no excuses.
Brier score = mean squared error of the 3-way probabilities (lower is better; 0.667 = uniform guessing, my backtest = 0.176). Knockout fixtures aren't in the locked group-stage prediction set, so they're tracked as results only. Results source: fixturedownload.com, synced by .github/workflows/update-results.yml.
Loading live tournament data…
the football is the demo — the same model ports directly to these domains
📦 Demand forecasting
Slow-moving SKUs are low-count events, exactly like goals. Calibrated outcome probabilities + the Monte Carlo layer = inventory scenario planning with honest service-level estimates instead of point forecasts.
🏦 Credit risk
Default / cure / prepay is the same 3-way classification as win / draw / loss. Banks are required to prove calibration (Basel) — reliability diagrams, Brier decomposition and temperature scaling are the exact validation toolkit.
🛡 Insurance pricing
Claim frequency is count modeling under exogenous conditions. The XGBoost weather sub-model is literally a catastrophe-covariate block; calibrated probabilities are the difference between profitable and mispriced premiums.
🏭 Predictive maintenance
P(failure within 30 days) with epistemic uncertainty tells you not just what will break but when the model doesn't know — inspect exactly those assets. I shipped this pattern in production at Powertrac and Tata Motors.
🩺 Clinical risk scores
Readmission and complication models are only useful if 70% means 70% — miscalibrated scores erode clinician trust. The BNN spread flags the patients where the model should defer to humans.
⚡ Energy & finance
The time-decay feature weighting is EWMA from volatility modeling; Monte Carlo propagation of match uncertainty to title odds is the same machinery as load forecasting bands and portfolio scenario analysis.
📘 Model deep dive — architecture, features, training, calibration, serving
The problem
Football outcomes are low-count, high-variance events: a 3-way classification (home / draw / away) where even the best team loses often. The interesting challenge isn't predicting a winner — it's producing an honest probability, one you could bet on, plan inventory around, or price risk with. That's why the whole system is built around calibration rather than raw accuracy.
Features
Each fixture is described by engineered team-pair features: rolling form and attack/defense rates with time-decay weighting (the same idea as EWMA in volatility modeling), ranking deltas, rest and travel context, and host advantage. On top of that sit two live signals: an NLP pipeline that pulls RSS news, scores sentiment per team and per key player, and folds it into the features — and a weather block (temperature, humidity, precipitation, wind, cloud cover, pressure) for the actual stadium and kickoff window, which you can toggle in the Match center to see its real effect per fixture.
Architecture — three models, one ensemble
SNN: a PyTorch policy-gradient neural network over the engineered features, trained with a policy-gradient objective and emitting a 3-way softmax. It's the fast, deterministic baseline — one forward pass, one answer. BNN: the same backbone extended with a Bayesian output head. Instead of fixed weights, the head learns weight distributions; at inference I draw hundreds of Monte Carlo weight samples, each producing a slightly different network and prediction. The spread of those predictions decomposes into total uncertainty (predictive entropy), epistemic uncertainty (mutual information — what the model doesn't know because it hasn't seen enough data) and the win-probability standard deviation you can watch being sampled in the "Inside the model" tab. XGBoost weather sub-model: a gradient-boosted tree model that learns how weather shifts outcomes, feeding the ensemble. The final calibrated ensemble blends SNN and BNN with the weather signal and applies temperature scaling so that "70%" means 70%. On top sits a Poisson scoreline layer: it inverts the calibrated win/draw/loss probabilities into goal rates (λhome, λaway) and reports the most likely scoreline consistent with the pick — score predictions are a pure function of the same probabilities, not a separate guess.
Training & validation
Validated with 5-fold cross-validation and historical backtests; every variant comparison (e.g. weather on vs off) goes through McNemar tests in an A/B framework, so a feature only survives if it helps beyond noise. Calibration diagnostics — reliability diagrams, expected calibration error, Brier decomposition — plus experiment metadata are logged to PostgreSQL. Results: Brier score 0.24 → 0.176 (−26.7% vs baseline), recall 84%, scenario variance −17%.
The Monte Carlo tournament layer
Match probabilities become title odds by simulation: sample all 72 group fixtures from the ensemble's distributions, resolve standings (including the 8 best third-placed teams in the 48-team format), play the bracket to the final, repeat 5,000 times, and count champions. Uncertainty propagates from single matches all the way to the trophy — exactly what the simulator tab does in your browser with the same numbers.
Serving & feedback
The real system runs behind FastAPI: REST endpoints for predictions, team data and result uploads, a post-match feedback loop that updates the model as results land, and the news pipeline refreshing sentiment features continuously. It's deployable software, not a notebook.
Why it generalizes
Calibrated low-count forecasting is the same problem as demand planning for slow-moving inventory, credit-risk PDs, and insurance claim frequency. The football is the demo; the method is the product.