TL;DR / Key Takeaways

ECMWF IFS consistently outperforms GFS at the 5-7 day forecast range, with published RMSE scores showing roughly 10-15% lower 500 hPa geopotential height error at day 5.
The gap matters for Kalshi temperature markets because most contracts settle 3-7 days out, exactly where the model skill divergence is largest.
Neither model is dominant at all ranges or all locations, which is why using both in a weighted ensemble is better than picking one and ignoring the other.
The Predict & Profit Weather Bot v2.1 uses a 4-source grand ensemble of up to 164 members, with ECMWF IFS carrying the highest weight at 0.30 specifically because of its verified skill advantage.

This is a question I get asked regularly: if ECMWF is better, why not just use ECMWF and ignore GFS entirely? It is a reasonable question. The answer requires looking at what "better" actually means in the context of a specific trading application, not just general model benchmarking.

Let me walk through the actual verification data, then explain the ensemble logic.

The Benchmark Standard: 500 hPa Anomaly Correlation

The meteorological community measures forecast skill primarily using two metrics: anomaly correlation coefficient (ACC) at 500 hPa geopotential height, and root mean square error (RMSE) for 2-meter temperature and 500 hPa height fields.

The 500 hPa level is used because it represents mid-tropospheric flow, which drives surface weather patterns. A forecast that gets the mid-level circulation wrong will almost certainly get surface temperatures wrong too.

ECMWF publishes its own verification scores publicly through its Forecast Verification pages. NOAA publishes GFS verification through the Environmental Modeling Center. The comparison has been consistent for over a decade.

At day 5 (120 hours), ECMWF IFS 500 hPa ACC over the Northern Hemisphere typically runs around 0.93-0.95. GFS runs around 0.88-0.92 over the same period. That gap looks small in absolute terms. In forecast skill terms, it represents roughly a 6-12 hour lead time advantage: ECMWF's day 5 forecast is about as accurate as GFS's day 4.5 forecast.

ECMWF's own published verification reports (available at ecmwf.int/en/forecasts/charts/monitoring/) show this consistently across multiple years and verification periods. The 2023 and 2024 annual reports both document the IFS skill advantage in the medium range (days 4-7).

For surface temperature specifically, which is what Kalshi temperature markets settle on, ECMWF's 2-meter temperature RMSE at day 5 is typically 2.0-2.5°C over North America. GFS runs higher, around 2.5-3.2°C over the same period and region. Those are rough ranges pulled from published EMC verification graphics and ECMWF scorecard data. The exact numbers shift with season and domain.

Where GFS Holds Its Own

GFS is not simply inferior across the board. There are regimes and lead times where the gap narrows or reverses.

At short range (days 1-2), both models perform well and the skill difference is small. The gap opens up in the medium range (days 4-7), which is exactly where Kalshi temperature contracts tend to live.

GFS has a denser observation assimilation cycle in some respects, particularly for the continental US, and it has caught up significantly since the 2021 FV3 upgrade. The National Centers for Environmental Prediction have published year-over-year verification showing steady GFS improvement. It is a much better model than it was five years ago.

GFS also has an availability advantage. It runs four times per day (00Z, 06Z, 12Z, 18Z) with full ensemble output publicly available through Open-Meteo and NOAA's own servers with no authentication required. The data pipeline is well-documented and stable.

And practically: GFS ensemble data is free, well-structured, and arrives quickly. The 31-member GEFS (GFS ensemble system) gives you a probability distribution, not just a single deterministic run. That matters more than which model has the lower RMSE in isolation.

The ECMWF Ensemble: 51 Members and Why That Matters

ECMWF's ensemble forecast system (IFS ENS) runs 51 members: one control run plus 50 perturbed members. The perturbations are generated using singular vectors, which are mathematically designed to represent the directions of maximum forecast uncertainty in the initial conditions.

This is not a trivial point. The way you perturb the initial conditions determines whether your ensemble spread is meaningful or just random noise. ECMWF's singular vector approach is theoretically grounded and has been validated against verification data over decades.

The result is that ECMWF ensemble spread tends to be better calibrated than GFS ensemble spread. "Calibrated" means the ensemble spread actually predicts the range of outcomes that materialize in reality. A well-calibrated ensemble tells you something real about forecast uncertainty. A poorly calibrated one gives you spread that is either too tight (overconfident) or too wide (uninformative).

Published ECMWF verification documents consistently show IFS ENS ranked as the best global ensemble system by most WMO verification metrics. The THORPEX Interactive Grand Global Ensemble (TIGGE) database, which collects ensemble data from multiple operational centers, has been used for intercomparison studies. Multiple published papers using TIGGE data (e.g., Park et al. 2008, Hagedorn et al. 2012) document ECMWF's ensemble superiority in both RMSE and reliability diagrams.

For Kalshi trading, calibration matters directly. If I am using ensemble spread to estimate my probability distribution for where temperature will settle, a better-calibrated ensemble gives me a more accurate edge estimate. An overconfident ensemble (too tight spread) will lead to trades sized as if the edge is better than it is.

The GFS Ensemble: 31 Members, Different Perturbation Approach

GEFS (Global Ensemble Forecast System) runs 31 members using the ensemble transform with rescaling (ETR) perturbation method. It is a different approach from ECMWF's singular vectors.

GFS ensemble spread historically ran too tight, which meant overconfident probability estimates. The FV3 upgrade and subsequent GEFS v12 improvements have addressed this partially, but independent verification studies still tend to rank ECMWF ENS above GEFS in ensemble reliability.

31 members versus 51 members also means the tails of the distribution are less well-sampled in GEFS. For temperature markets, extreme outcomes (hot or cold outliers) matter because Kalshi contracts often cluster near the forecast median. If the ensemble is under-sampling the tails, you are missing the contracts where edge might exist.

Why the Bot Uses Both

Here is the actual logic from the Weather Bot v2.1 ensemble weighting:

ENSEMBLE_SOURCES = {
    "GFS": {
        "provider": "open_meteo",
        "members": 31,
        "weight": 0.25,
    },
    "AIGEFS": {
        "provider": "aws_s3",
        "members": 31,
        "weight": 0.25,
    },
    "ECMWF_IFS": {
        "provider": "open_meteo",
        "members": 51,
        "weight": 0.30,
    },
    "ECMWF_AIFS_ENS": {
        "provider": "open_meteo",
        "members": 51,
        "weight": 0.20,
    },
}

def compute_grand_ensemble_probability(source_probs: dict, weights: dict) -> float:
    """
    Weighted average of per-source probabilities.
    source_probs: {"GFS": 0.72, "AIGEFS": 0.68, "ECMWF_IFS": 0.81, "ECMWF_AIFS_ENS": 0.75}
    weights: from ENSEMBLE_SOURCES above
    """
    total_weight = sum(weights[k] for k in source_probs)
    weighted_sum = sum(source_probs[k] * weights[k] for k in source_probs)
    return weighted_sum / total_weight

ECMWF IFS gets 0.30, the highest single weight. GFS and AIGEFS each get 0.25. ECMWF AIFS-ENS gets 0.20 because it is newer and has less published verification history, even though early results are promising.

The reason to include GFS at all, despite ECMWF's skill advantage, comes down to three things.

First, multi-model ensembles consistently outperform any single model ensemble in published intercomparison studies. This has been shown repeatedly in TIGGE-based research. Even adding a lower-skill model to a grand ensemble reduces error compared to using only the best model, as long as the lower-skill model has partially independent errors. GFS and ECMWF use different dynamical cores, different data assimilation schemes, and different physics parameterizations. Their errors are not perfectly correlated, which means combining them reduces ensemble mean error.

Second, the agreement filter. The bot requires at least 3 of 4 ensemble sources to agree on direction before placing a trade. If ECMWF says "above threshold" and GFS says "below threshold," that disagreement is itself information. It means the situation is genuinely uncertain and the edge estimate is unreliable. Including GFS in the check catches cases where ECMWF might be confidently wrong.

Third, operational reliability. If Open-Meteo's ECMWF feed has an issue (it happens, infrequently but it happens), the bot degrades gracefully rather than going blind. Having multiple data sources is a systems architecture decision as much as a statistical one.

The ENSEMBLE_COMPARE Log Line

Every trade candidate the bot evaluates produces a log line that shows per-source probabilities before the weighted combination is computed:

ENSEMBLE_COMPARE | market=HIGHNY-2026-06-20-T85 | GFS=0.71 | AIGEFS=0.69 |
ECMWF_IFS=0.83 | ECMWF_AIFS_ENS=0.78 | grand_ensemble=0.758 | 
market_price=0.62 | edge=0.138 | agreement=4/4 | TRADE_FIRED

When you see a line like this with 4/4 agreement and ECMWF showing notably higher probability than GFS, that is a case where the ECMWF skill advantage is probably manifesting as real edge. The models with better long-range skill are pulling the grand ensemble probability higher than the market price has adjusted to.

When you see GFS and ECMWF pointing in opposite directions, the trade does not fire. Not because one is definitively right, but because the uncertainty is too high to have confidence in the edge estimate.

What the Published Verification Data Does Not Tell You

Model verification scores are computed over large geographic domains and long averaging periods. The ECMWF advantage at 500 hPa over the Northern Hemisphere is real and consistent. But Kalshi temperature markets settle at specific ASOS stations in specific cities.

Local model performance varies. GFS sometimes performs better than ECMWF for specific cities during specific seasons due to local topographic effects, coastal dynamics, or other regional factors that global skill scores smooth over. Neither ECMWF nor NOAA publishes city-level skill breakdowns by season, at least not publicly and not for the specific stations Kalshi uses.

This is one reason the ensemble approach is more defensible than picking a winner. If you had 10 years of verification data for ECMWF IFS vs GFS specifically at LaGuardia Airport in June, you might be able to optimize weights differently. That data is not publicly available at that resolution.

The honest position is: use the published evidence to set weights, use the agreement filter to catch cases where the evidence is contradictory, and let the math run.

The Practical Edge for Kalshi Temperature Markets

Most Kalshi temperature contracts settle somewhere in the 5-10 day forecast range. That is exactly where ECMWF's skill advantage over GFS is most pronounced, and where having 51 well-calibrated ensemble members matters most for probability estimation.

If you are making temperature probability estimates using a single deterministic GFS run, you are leaving a lot of information on the table. You get no uncertainty estimate, no model disagreement signal, and no way to know whether the forecast is in a high-confidence or low-confidence meteorological regime.

The Weather Bot uses 164 total ensemble members across four systems. ECMWF carries the heaviest weight because the verification literature supports it. GFS stays in the mix because multi-model ensembles beat single-model ensembles and because model errors are partially independent.

Divergence between the models is a warning signal. Convergence is the edge.

The published data gives you the weights. The agreement filter enforces discipline when the models disagree. That combination is what separates systematic trading from guessing at weather forecasts.

ECMWF vs GFS: Which Weather Model Is Actually More Accurate for Kalshi Trading?