< Back to Blog

The Overfitting Trap: Why Your Backtest Is Lying to You

TL;DR / Key Takeaways

  • A clean backtest is only useful if live forward performance degrades by a small, explainable amount.
  • The bot uses four core inputs instead of many curve-fit conditions: ensemble spread, confidence, model-to-market gap, and fee efficiency.
  • Every production update gets at least 30 days of live-data paper trading before real capital is used.
  • Real-time ensemble convergence is treated as stronger evidence than historical pattern matching.

I spent six weeks getting my backtest to look perfect. Win rate above 80%. Clean equity curve. I was convinced I had cracked prediction market weather trading.

I had not. I had overfit.

Here is what I learned building the Predict & Profit automated trading bot, and why I now trust a 63% live accuracy rate more than a 90% historical one.

What overfitting actually looks like

Overfitting happens when your model learns the noise in historical data instead of the signal. In weather trading for Kalshi prediction markets, this is particularly nasty because weather patterns shift. A model calibrated to El Niño conditions from 2023 behaves differently when the atmosphere moves into a La Niña cycle in 2026.

The test for overfitting is simple: keep adding variables and watch your backtest accuracy go up. That is not optimization. That is teaching your algorithm to memorize the past. If your strategy needs 15 conditions to converge before it fires, you have probably built something that only works on the specific historical dataset you trained it on, not on the live market.

My approach is the opposite. The Predict & Profit edge scoring system uses four inputs: ensemble spread, ensemble confidence, model-to-market gap, and fee efficiency. Four inputs. Not fourteen. Simplicity is the first defense against overfitting, and it is the reason this system can survive regime changes in the atmosphere that would destroy a heavily curve-fitted model.

The benchmark that actually matters

| Evaluation mode | Data used | Failure it exposes | Production decision | | --- | --- | --- | --- | | Backtest | Historical forecasts and settled markets | Curve fitting to known outcomes | Calibrate thresholds only | | Forward test | Live forecasts and real-time Kalshi prices | Overfitting, stale data, execution issues | Approve or reject deployment | | Live trading | Real capital and settled contracts | Full model, fee, and execution risk | Measure realized edge |

A backtest result that degrades slightly in live forward testing is a healthy sign. A backtest that collapses in live trading is proof you never had a real edge to begin with.

Here is the benchmark I use when evaluating any update to the system:

A model that shows 65% accuracy in backtesting and 63% in live forward testing is worth running. A model that shows 90% in backtesting and 55% live is not.

The 2% degradation is normal. It is the friction between a known historical dataset and the genuinely uncertain future. The 35% degradation means your strategy was trading the past.

When I upgraded from the original 31-member GFS-only configuration to the 62-member HGEFS hybrid ensemble, the backtest accuracy did not improve dramatically. What improved was the consistency. The live-to-backtest degradation tightened. That told me the new model was capturing a real meteorological signal, not fitting to historical noise. Two independent modeling approaches — one physics-based, one AI — agreeing on the same outcome is much harder to fake in a backtest than a single curve-fitted model.

My forward testing protocol before anything goes live

Every update to the bot goes through a minimum 30-day paper trading period before it touches real capital. No exceptions.

The paper trading environment uses live Open-Meteo API data and real-time Kalshi order book prices. The only difference from production is that no actual orders are submitted. Every other part of the pipeline runs exactly as it would live — the HGEFS ensemble ingestion, the edge scoring, the fee filter, the execution kill switch.

At the end of 30 days I compare projected edge against actual settlement data. If the live performance variance exceeds 5% of what the backtest projected, the update is rejected and goes back to the drawing board. This process has killed two upgrades that looked strong on paper. It also validated the HGEFS upgrade, which is how I knew it was ready for real money.

Thirty days sounds like a long time to wait. It is also the difference between deploying something real and deploying something that only worked in Excel.

Why real-time convergence beats historical curve fitting

The reason this system resists overfitting better than most approaches is that it does not rely on historical patterns at all. It does not look at what happened in similar conditions in the past. It looks at what 62 independent atmospheric simulations are saying right now, and whether the Kalshi market is pricing the outcome correctly.

That is a real-time probabilistic edge. The atmosphere does not need to repeat a historical pattern for the edge to exist. It just needs the ensemble members to agree, and the market to be wrong.

Historical data is useful for calibrating filter thresholds and testing the fee-efficiency formula. It is not useful for predicting specific temperature outcomes. The 62 GFS and AIGEFS ensemble members running right now, initialized with today's atmospheric conditions, tell you more about tomorrow's high temperature in Chicago than ten years of May averages ever will.

The honest result

The original 31-member system had 1 win and 5 losses before I upgraded and tightened the filters. I published that in the HGEFS upgrade post because hiding it would be pointless and dishonest.

What I know from forward testing is that the stricter convergence requirements on the 62-member system are filtering out the marginal trades that generated those early losses. The agreement requirement between the GFS physics ensemble and the AIGEFS AI ensemble eliminates roughly half the opportunities the old system would have taken. That is by design. I would rather wait for high-confidence setups than force activity on coin-flip signals.

Forward test results confirm that direction. Live P&L will confirm it or not, and I will publish that data regardless of outcome.

Do not fall in love with your backtests. Treat them as a rough calibration tool and forward test everything before you put real money on it.


Want to run a system built around real-time ensemble convergence instead of historical curve fitting? The full Python source code — including the HGEFS pipeline, edge scoring system, and 30-day forward testing framework — is available now.

Get the Source Code — $67

Frequently Asked Questions

Q: What is the technical difference between backtesting and forward testing?

A: Backtesting evaluates a strategy on historical data whose outcomes are already known. Forward testing runs the same pipeline against live market prices and live model data before settlement is known, which exposes overfitting and operational issues that historical tests hide.

Q: Why does the bot require a 30-day paper trading window?

A: The 30-day window forces every update to prove itself on new data, real Kalshi prices, and current weather-model behavior. If projected edge and actual settlement performance diverge by more than the allowed variance, the update is rejected before real capital is used.

Q: Why can a lower live win rate be more trustworthy than a perfect backtest?

A: A modest live win rate with small backtest degradation is evidence that the model is capturing a repeatable signal. A perfect historical result often means the strategy learned noise, especially when many filters were tuned to one historical sample.

Related Reading