← Back to blog

Published on Sun Oct 26 2025 11:50:00 GMT+0000 (Coordinated Universal Time) by Claude with cresencio

When Models Go to War: Chicago @ Baltimore and the Art of Prediction Disagreement

Disclaimer: This analysis is for educational and entertainment purposes only. This is not betting advice, financial advice, or a recommendation to wager on sporting events. Always gamble responsibly.

The Most Divisive Game of 2025

On Sunday, October 26th, the Chicago Bears (4-2) travel to Baltimore to face the Ravens (1-5). On paper, this looks straightforward: a winning team versus a struggling one. But when you run the numbers through five different prediction models, something extraordinary happens.

The models don't just disagree—they go to war.
  • ELO Rating System: Ravens 70.5%
  • Logistic Regression: Bears 83.5%
  • XGBoost: Bears 51.0%
  • Bayesian Model: Ravens 71.9%
  • Ensemble Average: Bears 58.5%

This is a 55.4% spread between the highest and lowest predictions—the widest disagreement of the entire 2025 season. For context, most NFL games see prediction spreads under 20%. When models agree within 10%, you have consensus. When they diverge by 50%+, you have chaos.

But this chaos isn’t random. It’s a window into how different modeling philosophies interpret the same reality in fundamentally different ways. And by understanding why these models disagree, we can learn something profound about both football and prediction science.

A Tale of Two Modeling Philosophies

The Traditionalists: ELO and Bayesian

ELO (70.5% Ravens) and Bayesian (71.9% Ravens) represent the conservative, historically-grounded approach to prediction.

ELO Rating Systems work like chess rankings: teams earn or lose rating points based on game outcomes, with bigger upsets causing larger point swings. The system is elegant, transparent, and ruthlessly mathematical. A team’s rating is its accumulated proof of quality over time.

ELO systems typically achieve 65-70% prediction accuracy in NFL forecasting—solid, but not spectacular. They excel at identifying long-term quality but struggle with rapid change.

Bayesian Hierarchical Models take a different but complementary approach. Rather than a single rating, Bayesian models estimate probability distributions over team strength, explicitly quantifying uncertainty. They use partial pooling to share information across teams while respecting individual identity.

Bayesian models are efficient with all current-season data but computationally heavy. They provide interpretable parameters and calibrated probabilities—meaning when a Bayesian model says 72%, it genuinely means a 72% chance, not just a relative confidence score.

Why do they favor Baltimore?

  1. Pre-season priors: Both models begin the season with beliefs about team quality based on historical performance. Baltimore entered 2025 with strong ratings from prior years.

  2. Home field advantage: ELO assigns roughly +65 rating points (≈3 points on the spread) for playing at home. Bayesian models similarly encode home advantage.

  3. Slow updating: These models are designed to avoid overreacting to small samples. Six games isn’t enough to completely override their prior beliefs.

  4. Rating momentum: Baltimore’s 1-5 record has hurt their rating, but not enough to overcome their starting point plus home field.

The Reactionaries: Logistic Regression

Logistic Regression (83.5% Bears) takes the opposite philosophical stance.

Logistic models are the workhorses of modern prediction. They take a set of features (recent performance, offensive efficiency, defensive metrics, etc.) and learn statistical weights for each feature to maximize classification accuracy. They’re fast to implement and widely adopted in real-world sports analytics, with accuracy around 65-70% as baselines.

But here’s the key: logistic regression is only as good as its features and their recency weighting.

Our logistic model is clearly optimized to weight recent games heavily. It sees:

  • Chicago’s 3-0 streak in their last three games
  • Baltimore’s 0-3 free fall over the same period
  • Baltimore scoring just 13 total points in their last two games (6.5 PPG)
  • Baltimore allowing 32.3 points per game on the season

To a feature-based model that trusts recent data, this isn’t close. Chicago is hot, Baltimore is collapsing. The model doesn’t care about Baltimore’s history or reputation—it cares about what happened in Weeks 5, 6, and 7.

This is the classic tension in prediction: Do you trust the long view or the recent view? History or current form?

The Hedger: XGBoost

XGBoost (51.0% Bears) represents the machine learning middle ground.

Gradient Boosted Trees are among the most accurate methods for football prediction, with AUC scores around 0.87 for college football. They capture nonlinear interactions among team and game features and are flexible enough to identify patterns that linear models miss.

But XGBoost in this game is essentially saying: “I have no idea.”

A 51% prediction is a model throwing up its hands. It’s receiving conflicting signals:

  • Traditional indicators (home field, season-long ratings) favor Baltimore
  • Recent performance trends favor Chicago
  • The features don’t resolve cleanly into a strong directional signal

This is actually intellectually honest behavior. When a model is genuinely uncertain, it should say so rather than forcing confidence. XGBoost sees both sides of the argument and can’t pick a clear winner.

Tree-based models are robust to irrelevant or missing data but require careful hyperparameter tuning. An uncertain prediction might mean the model is properly calibrated—or it might mean it’s not tuned correctly for this specific edge case.

What the Data Actually Says

Let’s step back from modeling philosophy and look at what’s actually happening on the field.

Chicago Bears: The Lucky Winners

Record: 4-2
Points Per Game: 25.3 (scoring)
Points Allowed: 25.8 (defense)
Point Differential: -0.5 per game

Yes, you read that correctly. Chicago has a winning record despite being outscored on the season. They’ve accumulated a total point differential of -3 points across six games.

This is the classic profile of a team playing above their fundamentals. Chicago is 2-1 in one-score games—winning 66.7% of games decided by seven points or fewer. They’re clutch in close moments, but the underlying numbers suggest this isn’t sustainable.

Pythagorean Expectation (a formula that predicts wins based on points scored and allowed) says Chicago should be 2.9-3.1, not 4-2. They’re outperforming by approximately 1.1 wins.

Advanced feature engineering includes opponent-adjusted stats and consideration of whether team performance is efficient or lucky. Chicago’s profile screams regression candidate.

Baltimore Ravens: The Catastrophic Collapse

Record: 1-5
Points Per Game: 24.0 (scoring)
Points Allowed: 32.3 (defense)
Point Differential: -8.3 per game

Baltimore’s season is a disaster. But the concerning part isn’t just the record—it’s the trajectory.

Last Three Games:

  • Week 5 vs Houston: 10-44 (lost by 34 at home)
  • Week 6 vs LA: 3-17 (lost by 14 at home)
  • Week 7: Bye

In their last two home games, Baltimore scored 13 total points. That’s not a slump—that’s a systematic offensive failure. This is a team that cannot execute basic offensive football right now.

Their defense isn’t helping either. Baltimore has allowed 37+ points in four of six games. When you combine anemic offense with porous defense, you get an -8.3 point differential per game—among the worst in the NFL.

Home Field Advantage? Baltimore is 1-3 at home this season. Home field is worth roughly +48 ELO points or 3 points on the spread. But that assumes home field actually helps. For Baltimore in 2025, playing at home seems to make no difference.

The Ensemble’s Verdict

So what happens when you average all five models together?

Ensemble Prediction: Bears 58.5%

Ensemble methods deliver best-in-class predictive performance. Stacking multiple models can achieve ~82% betting win-rate versus ~70% for single models. The logic is simple: different models capture different aspects of reality, and averaging reduces individual model bias.

In this case, the ensemble prediction is a diplomatic compromise:

  • Three models favor Chicago (Logistic 83.5%, XGBoost 51%, Ensemble 58.5%)
  • Two models favor Baltimore (ELO 70.5%, Bayesian 71.9%)

The ensemble is saying: “Chicago is more likely to win, but it’s close enough that Baltimore could absolutely pull this off.”

Predicted Score: Chicago 22, Baltimore 24

Wait—the ensemble predicts Chicago to win but Baltimore to score more points? This apparent contradiction reveals the uncertainty baked into the prediction. The models are so divided that even the direction of the outcome is unclear. This is the hallmark of a true toss-up game.

What Does Model Disagreement Mean for Prediction?

This game is a perfect teaching moment about the nature of prediction itself.

Model disagreement isn’t a bug—it’s information.

When models agree strongly (like the Week 8 Colts over Titans at 92%), we have high confidence. Multiple independent approaches converging on the same answer is powerful evidence.

When models disagree (like this 55.4% spread), we have one of two situations:

  1. Genuine uncertainty: The game is truly close, and different reasonable approaches can’t distinguish between outcomes.
  2. Model error: One or more models are systematically wrong about something fundamental.

This concept is called calibration. A well-calibrated model doesn’t just predict accurately—it predicts correctly about its confidence. If a model says 70% and it’s right 70% of the time, it’s calibrated. If it says 70% but it’s right 50% of the time, it’s overconfident.

ELO and Bayesian models are generally well-calibrated because they’re built on decades of statistical theory. But they’re slow to adapt to rapid change.

Logistic models can be extremely accurate when recent trends matter. But they can also overfit to noise and mistake variance for signal.

XGBoost can capture complex interactions. But its “no idea” response here suggests either genuine uncertainty or a model that’s not tuned for this edge case.

The ensemble tries to balance these perspectives. It’s saying: “Recent form matters more than history this time, but not by as much as the logistic model thinks.”

Why Each Model Sees It Differently

Let’s break down what’s happening with each modeling approach for this specific matchup:

1. ELO’s weakness is showing: ELO cannot capture nonlinear interactions and its accuracy plateaus with sudden shifts. Baltimore’s collapse is nonlinear—it’s not a gradual decline but a sudden offensive implosion. ELO’s slow updating mechanism can’t react fast enough.

2. Logistic regression’s strength: Logistic models are fast to implement and have minimal overfitting when designed well. If recent form is properly weighted, they excel at detecting trend breaks. Chicago’s hot streak and Baltimore’s cold streak are exactly the kind of features logistic regression was built to find.

3. XGBoost’s honest uncertainty: Tree-based ensembles handle nonlinear relationships but require careful hyperparameter tuning. The 51% prediction isn’t failure—it’s the model saying the features genuinely don’t resolve. That’s more honest than forcing confidence.

4. Bayesian’s philosophical choice: Bayesian models naturally quantify uncertainty and use partial pooling to balance team-level and league-level information. The 72% Baltimore prediction reflects strong priors that haven’t fully updated. This is by design—Bayesian methods are conservative to avoid overreacting to small samples.

5. Ensemble’s compromise: Ensembles reduce individual model bias and maximize predictive performance. The 58.5% Bears prediction is exactly what stacking should produce when base models disagree—a weighted average that hedges the extreme positions.

So What Should We Expect?

If I had to synthesize all of this into a prediction framework:

The Case for Chicago (58.5% probability):

  • Riding a genuine 3-0 hot streak with improving offensive execution
  • Baltimore’s offense is catastrophically broken (3 points, 10 points in last two home games)
  • Baltimore’s defense is porous (32.3 PPG allowed)
  • Baltimore’s home field advantage is effectively neutralized (1-3 at home)
  • Recent form matters more than season-long ratings in mid-season

The Case for Baltimore (41.5% probability):

  • Chicago’s record is built on luck (negative point differential despite 4-2)
  • Pythagorean expectation says Chicago should regress
  • Home teams still win more often than not in the NFL
  • Baltimore could have a bounce-back game after the bye week
  • ELO and Bayesian models have strong historical track records

The Expected Outcome: A close, ugly game. Both teams have defensive issues. Chicago has been winning on clutch execution in close games. Baltimore’s offense is broken, but they could show up after rest.

Predicted Score Range: 20-24 for the winner, 17-21 for the loser. Low-scoring, decided by 3-7 points, genuinely uncertain outcome.

The Educational Takeaway

This game exemplifies why football prediction is both a science and an art.

The science: Multiple statistical approaches, rigorous feature engineering, empirically validated models, ensemble techniques.

The art: Knowing when to trust history versus recent form, understanding model assumptions and limitations, recognizing genuine uncertainty versus model error.

There is no single best model. Gradient boosting achieves higher raw accuracy, but logistic regression is more interpretable. Bayesian models quantify uncertainty better, but ELO is faster to compute. Ensembles combine strengths but at the cost of complexity.

The best approach depends on:

  • What you’re trying to predict (win probability, point spread, total points)
  • How much data you have (more data favors complex models)
  • How quickly you need updates (real-time favors simple models)
  • Whether you need interpretability (regulation, trust, education favor transparent models)

For this specific game, the models are teaching us that:

  1. Recent performance and historical quality are genuinely in conflict
  2. Dramatic offensive collapses (Baltimore) are harder for rating systems to capture quickly
  3. Lucky winning streaks (Chicago) can fool simpler models
  4. Model disagreement itself is valuable information about uncertainty

The Final Predictions

After synthesizing all five models, the research literature, and the underlying data:

Primary Prediction: Chicago Bears win, 22-20
Confidence Level: Low-Moderate (58.5%)
Predicted Total: 42 points
Key Factor: Baltimore’s offensive collapse is real and severe

Why Chicago: The logistic model is likely correct that recent form outweighs historical ratings in this specific case. Baltimore scoring 3 and 10 points in consecutive home games isn’t noise—it’s a systemic problem. Chicago’s 3-0 streak, while built on luck, still represents functional offensive execution that Baltimore currently lacks.

Why It’s Close: Chicago’s fundamentals are weak (negative point differential), and Baltimore could easily rebound after a bye week. The ELO and Bayesian models aren’t wrong to be skeptical—they’re capturing real uncertainty about whether recent trends persist or reverse.

The Recommendation (Educational Only): If forced to pick, take Chicago. But recognize this is essentially a coin flip with a slight edge. The 58.5% prediction means Chicago wins this game about 6 out of 10 times in expectation—which also means Baltimore wins 4 out of 10 times.

This is the art of prediction: not claiming certainty where none exists, but making informed probabilistic judgments based on the best available evidence and modeling approaches.


Want More?

If you’re interested in the raw data, modeling code, or want to discuss prediction methodologies, reach out to me on X @Cresencio. I’m always happy to share data and talk football analytics.


Disclaimer: This analysis is for educational and entertainment purposes only. All predictions are based on statistical models and historical data. Past model performance does not guarantee future results. This is not betting advice, financial advice, or a recommendation to wager on sporting events. Please gamble responsibly and within your means. The author and publisher assume no responsibility for decisions made based on this analysis.

Written by Claude with cresencio

← Back to blog