Two days ago I let a machine fill out my bracket. A stacked ensemble trained on 18 years of tournament data, 45+ features per team, 10,000 Monte Carlo simulations. It picked Duke to win it all.
Then the actual games started.
The Chalk Was Easy
The model crushed the top of the bracket. All four 1-seeds advanced. All four 2-seeds advanced. All four 3-seeds advanced. Every 4-seed won. That's 16 for 16 on the top four seed lines, which sounds impressive until you realize my dog could have done the same thing by picking the team listed first.
The model was built to find signal in efficiency metrics, and at the top of the bracket, the signal is deafening. Duke's adjusted efficiency margin is so far above Siena's that you don't need machine learning to tell you what happens. You need it for the games where the numbers are close, and that's exactly where the model fell apart.
The Madness Was Hard
Eight upsets. The model missed every single one of them.
The headliners: 12-seed High Point knocked off 5-seed Wisconsin by one point. 11-seed VCU took down 6-seed North Carolina. 10-seed Texas A&M beat 7-seed Saint Mary's. And in the most statistically unlikely outcome of the first round, all four 9-seeds won their games against 8-seeds. TCU over Ohio State. Saint Louis over Georgia. Utah State over Villanova. Iowa over Clemson. Historically, the 8-9 matchup is a coin flip. This year, the coin landed tails four times in a row.
The model had Ohio State winning comfortably. It had Clemson advancing. It even made one bold upset pick of its own: Santa Clara (10) over Kentucky (7), at 62.7% confidence. That one almost hit. Santa Clara led 73-70 with 2.4 seconds left after Allen Graves hit a three. Kentucky's Otega Oweh banked in a half-court shot at the buzzer to force overtime, then carried them to an 89-84 win with 35 points. The model had the right read on that game. It just couldn't account for a half-court bank shot.
The actual upsets came from directions the model never saw coming.
The Scorecard
Picking the higher seed in every game would have gotten you 24 out of 32 correct, a 75% hit rate. The model performed right around that same range. It didn't lose to the chalk baseline, but it didn't meaningfully beat it either.
That's both the honest assessment and the expected one. The model's 73.6% holdout accuracy was trained on games where the outcome was already determined. Predicting forward into a single-elimination tournament, where one bad shooting night ends your season, is a fundamentally different problem. The model identifies which team is better. The tournament asks which team is better on that specific Tuesday night in March.
A [recent paper out of arXiv](https://arxiv.org/html/2603.10916v1) on bracket prediction put it well: even sophisticated deep learning models and combinatorial fusion methods struggle to consistently beat simple seed-based heuristics in tournament play. The chaos isn't a bug, it's the product. Single-elimination formats are designed to produce variance, and no amount of feature engineering will predict the kid who banks in a three at the buzzer.
What the Model Still Likes
I re-ran the model with actual Round 1 results locked in. The surviving teams feed back into the same stacked ensemble, which re-predicts Round of 32 through the Championship and re-runs 10,000 simulations.
The updated championship odds shift, but not as dramatically as you might expect. The top tier is mostly intact because the 1, 2, and 3 seeds all survived. The model's pre-tournament favorite, Duke, is still sitting at the top. Michigan is still in the mix. Houston still looks dangerous.
What did change: the model now has to route predictions through the actual survivors. TCU replacing Ohio State in Duke's bracket path matters. High Point replacing Wisconsin creates a different matchup tree. The ripple effects of eight upsets propagate through every remaining prediction.
I'll publish the updated bracket and odds once Round of 32 wraps up this weekend.
What I Actually Learned (Part 2)
The original post argued that efficiency margins are the whole game. Round 1 didn't disprove that. The teams with the best efficiency metrics mostly won, and the upsets mostly came from games where the efficiency gap was smallest, exactly where you'd expect randomness to matter most.
But the model also revealed its blind spot: it has no concept of momentum, matchup-specific advantages, or the particular chaos of a team playing its best 40 minutes at exactly the right time. VCU didn't beat North Carolina because their season-long defensive efficiency was better. They beat them because they played a specific brand of disruptive, turnover-forcing basketball that UNC couldn't solve on that particular night.
Machine learning is great at answering "which team is generally better?" It's mediocre at answering "which team wins this specific game?" And in a tournament where one loss sends you home, "mediocre at specifics" is a real problem.
The model will keep running. The upsets will keep coming.
The full code, data, and live tracker are on GitHub.
