< Back to Blog

ECMWF AIFS: The AI Forecast Model That's Quietly Beating the Physics Models

TL;DR / Key Takeaways

  • ECMWF's AIFS-ENS is a pure machine learning model trained on ERA5 reanalysis data, with no atmospheric physics equations baked in. It learned dynamics purely from data.
  • On headline skill scores like RMSE and anomaly correlation, AIFS matches or beats the deterministic IFS at medium range, particularly for 500 hPa geopotential and 850 hPa temperature.
  • AIFS and GraphCast (NOAA's AIGEFS) both use transformer architectures, but they differ in training data, ensemble generation strategy, and operational deployment path.
  • The Predict & Profit Weather Bot runs all four systems together: GFS, AIGEFS, ECMWF IFS, and ECMWF AIFS-ENS. When AI models and physics models agree, that convergence is the edge.

There is a pattern in forecasting history that repeats itself. A new approach shows up. The establishment ignores it. The scores get published. The establishment gets uncomfortable. Then, quietly, the new approach gets folded into operations.

ECMWF's AIFS is at that stage right now.

The European Centre for Medium-Range Weather Forecasts has been the gold standard in numerical weather prediction for decades. Their physics-based Integrated Forecasting System has consistently outperformed every other operational model, including GFS. When ECMWF says their own AI model is now competitive with their own physics model, that is not a casual claim.


What AIFS Actually Is

AIFS stands for Artificial Intelligence Forecasting System. ECMWF began publishing research on it in 2023 and moved it into limited operational status through 2024 and into 2025.

The architecture is a graph neural network trained on ERA5, ECMWF's flagship atmospheric reanalysis dataset. ERA5 covers 1940 to present at 0.25-degree global resolution with 137 vertical levels. That is a massive training corpus. The model learned to predict atmospheric state transitions purely by seeing how the atmosphere actually behaved over 80+ years of observations.

No equations of motion. No parameterization schemes for convection or boundary layer physics. No numerical integration of fluid dynamics. Just: here is what the atmosphere looked like at time T, predict what it looks like at T+6h. Repeat to T+240h.

The ECMWF technical memo that describes the architecture in detail is Memo 887, "AIFS: A data-driven model for medium-range weather forecasting," published in 2024. It is publicly available on the ECMWF website. I am not going to pretend I found it accidentally. I went looking for it specifically because I was weighting this model in my bot and I wanted to understand what I was actually trusting.


The Skill Score Numbers

ECMWF published a direct head-to-head comparison between AIFS and their deterministic IFS in their 2024 evaluation work. The headline results surprised a lot of people.

At day 3 through day 7 forecast range, AIFS matches IFS on 500 hPa geopotential height for most of the globe. At day 7 and beyond, AIFS degrades more gracefully than the deterministic IFS in some regions, particularly in the Southern Hemisphere where observations are sparse and the physics model's error accumulation becomes a bigger problem.

For 2-meter temperature, AIFS is competitive through day 5. After that the deterministic IFS still holds a slight edge in the Northern Hemisphere mid-latitudes.

For precipitation, the AI models including AIFS are still weaker than the physics ensemble. Precipitation is a subgrid process. You are asking the model to predict something that depends on convective dynamics at scales the training data does not fully resolve. The physics ensemble handles this better right now.

ECMWF's own blog post from September 2024 titled "AIFS evaluation: performance highlights" summarizes it this way: AIFS has "comparable or better skill than the deterministic IFS for many variables at medium range." That is ECMWF saying this about their own competing system. They have no incentive to overstate it.

The Pangu-Weather and FourCastNet comparison papers in Nature and Science from 2023 established that AI models could match physics models at medium range. AIFS is ECMWF's in-house answer to that, trained on their own proprietary reanalysis data with their own infrastructure.


How AIFS Differs from GraphCast / AIGEFS

Both AIFS and GraphCast use graph neural network architectures. Both are trained on ERA5. Both produce deterministic forecasts at 0.25-degree global resolution. The surface similarity ends there.

GraphCast was developed by Google DeepMind and published in Science in December 2023. NOAA adopted it as the backbone for AIGEFS (Project EAGLE), which went operational in December 2025. That is the model my Weather Bot pulls from the NOAA AWS S3 bucket.

AIFS was developed entirely internally at ECMWF. They were not just using DeepMind's published architecture. They built their own graph transformer with attention mechanisms tuned to their specific training pipeline. The technical differences are documented in ECMWF Memo 887, but the practical upshot is: two different teams, two different training pipelines, similar architecture family, trained on the same ERA5 data.

The key operational difference is ensemble generation. GraphCast in its original paper produces a single deterministic forecast. NOAA's AIGEFS runs 31 members by perturbing initial conditions, similar to how traditional ensemble systems work.

ECMWF's AIFS-ENS, the ensemble variant, uses a different approach. Rather than just perturbing initial conditions, ECMWF explored stochastic perturbation of the model itself, adding noise to the latent space during the forecast run. This is meant to better represent model uncertainty, not just initial condition uncertainty. Their 2024 paper on AIFS-ENS, titled "AIFS-CRPS: Ensemble Forecasting using a Model Trained with a Loss Function based on the Continuous Ranked Probability Score," goes into detail. The CRPS loss function they use for training is specifically designed to produce well-calibrated probabilistic forecasts, not just accurate point estimates.

This matters. An ensemble that is well-calibrated in its spread is more useful than one that is overconfident or underconfident. For my purposes, an overconfident model that shows narrow probability distributions drives bad position sizing.


Why Running Both AI Models Matters

The question I had to answer when building the v2.1 Weather Bot was not "which model is best." It was "what combination of models gives me the most reliable probability estimate."

The answer, based on skill score literature and my own backtesting, is that diversity in the ensemble matters more than picking a single winner.

Physics models and AI models fail in different ways. The deterministic IFS accumulates numerical error through time integration. AIFS can produce smooth but physically inconsistent fields, particularly for extreme events that were rare in the training data. GFS has known biases in certain teleconnection patterns. AIGEFS/GraphCast has been operational for less than a year and its failure modes are still being characterized.

When all four systems agree on a temperature outcome, the probability estimate is more trustworthy than any single model could produce. That is the whole premise of ensemble forecasting, and it is why the ECMWF IFS ensemble with 51 members outperforms the deterministic IFS on probabilistic skill metrics.

The Predict & Profit Weather Bot runs GFS (31 members, weight 0.25), AIGEFS (31 members, weight 0.25), ECMWF IFS (51 members, weight 0.30), and ECMWF AIFS-ENS (51 members, weight 0.20). Total: up to 164 members from four independent systems.

ECMWF IFS gets the highest weight because it has the longest verified operational track record. AIFS-ENS gets a slightly lower weight than AIGEFS because AIGEFS has more ensemble members contributing and its operational track record, while shorter, is being actively evaluated by NOAA. Both AI ensemble weights will likely change as more verification data accumulates.

The agreement filter in the bot (--min-agreement 3) requires at least 3 of the 4 systems to point the same direction before a trade is considered. This directly operationalizes the concept of model consensus. The bot does not care which model is theoretically best. It cares whether they agree.


A Concrete Example: What the Log Looks Like

When the bot runs a candidate trade, the log output includes an ENSEMBLE_COMPARE line that shows per-source probability breakdown. Something like this:

2026-06-14 06:31:17 INFO [weather_bot] Evaluating: KORD-2026-06-15-HIGH-ABOVE-85
2026-06-14 06:31:17 INFO [weather_bot] ENSEMBLE_COMPARE ticker=KORD-2026-06-15-HIGH-ABOVE-85
  gfs_prob=0.71 (weight=0.25, members=31)
  aigefs_prob=0.68 (weight=0.25, members=31)
  ifs_prob=0.74 (weight=0.30, members=51)
  aifs_prob=0.72 (weight=0.20, members=51)
  grand_ensemble_prob=0.7163
  market_price=0.61
  edge=+0.1063
  agreement=4/4 [PASS]
  min_agreement=3 [PASS]
2026-06-14 06:31:17 INFO [weather_bot] dollar_edge_check: abs(0.1063) * 10 contracts = $1.063 >= $0.50 [PASS]
2026-06-14 06:31:17 INFO [weather_bot] TRADE_FIRED ticker=KORD-2026-06-15-HIGH-ABOVE-85 yes_prob=0.7163 market=0.61 edge=0.1063 contracts=10

Four systems. All pointing roughly the same direction. The market is pricing the event at 61 cents. The ensemble says 71.6%. That is real edge, not noise.

Now imagine only AIFS is pointing high while GFS and IFS are near 50%. The agreement filter kills that trade. Maybe AIFS is right. Maybe it found something the physics models missed. But I am not willing to bet real money on a single AI model going against two physics ensembles. Not with the verification record we have so far.

def check_agreement(source_probs: dict, min_agreement: int, threshold: float = 0.55) -> tuple[bool, int]:
    """
    Returns (passes_filter, agreement_count).
    A source 'agrees' if its probability exceeds threshold for YES
    or falls below (1 - threshold) for NO, consistent with the trade direction.
    """
    direction = "yes" if source_probs["grand_ensemble"] >= 0.50 else "no"
    agreement_count = 0
    for source, prob in source_probs.items():
        if source == "grand_ensemble":
            continue
        if direction == "yes" and prob >= threshold:
            agreement_count += 1
        elif direction == "no" and prob <= (1.0 - threshold):
            agreement_count += 1
    return agreement_count >= min_agreement, agreement_count

Simple. The elegance is in what it prevents, not what it does.


The Calibration Question

One thing I want to flag about AI weather models that the headline skill scores do not capture well: reliability diagrams and calibration.

A model can have excellent RMSE and still be poorly calibrated. Calibration asks: when the model says 70% probability of exceeding a threshold, does that event actually happen 70% of the time? If it only happens 55% of the time, the model is overconfident and you will systematically overprice YES contracts.

ECMWF published spread-skill analysis for AIFS-ENS in their evaluation work. The ensemble spread is reasonably well-calibrated at short to medium range, but like all ensemble systems it becomes under-dispersive at longer leads. Under-dispersive means the ensemble members cluster too tightly, producing false confidence.

For Kalshi temperature markets, which mostly settle within 1-5 days of trade entry, I am operating in the range where AIFS-ENS calibration is strongest. If I were trying to trade 10-day forecasts I would weight it differently.

NOAA has begun publishing verification statistics for AIGEFS/GraphCast against radiosondes and surface observations. Early numbers are consistent with the DeepMind published evaluation. I watch that verification page. The AI model track records are short. They are getting longer every month.


What ECMWF Itself Says About the Future

ECMWF published a roadmap called "Strategy 2021-2030" and has updated their AI model plans several times since. The direction is unambiguous: they are moving toward a hybrid system where AI and physics-based components coexist and reinforce each other.

The ECMWF blog post from November 2024 on "The path to an AI-enhanced IFS" describes work on using AI models to improve initial conditions, using physics models to constrain AI forecasts, and running AI ensembles as a complement to the IFS ensemble rather than a replacement.

This is the right answer. The people who built the IFS know better than anyone where it struggles. They also know better than anyone what AI can and cannot do yet. The fact that they are building a hybrid rather than replacing the physics model outright tells you something important about where the actual skill ceiling is.

For my purposes, this hybrid future is already here. I am running it on a RackNerd VPS for $97.


One Honest Limitation

AIFS-ENS is not the same as the full ECMWF IFS ensemble when it comes to extreme events. A deep cut-off low producing heavy precipitation, an anomalously strong ridge, a rapid cyclogenesis event: these are exactly the situations where physics models have been specifically tuned and validated over decades. They are also exactly the situations where AI models trained on climatological data can miss.

Extreme events are rare in the training data by definition. The model has seen fewer examples of them and may not capture the dynamics correctly. This is a known limitation in the literature and ECMWF has been explicit about it.

For temperature markets on Kalshi, which are mostly asking about high or low temperature at a single station, I am less exposed to this problem. I am not trading precipitation events. But I want to be honest about what I know the AI models do worse at. Anyone who tells you AI weather models are uniformly better than physics models is not reading the evaluation papers.


ECMWF built a physics model over 50 years that became the best in the world. Then they built an AI model that challenges it at medium range in five years. The physics model is not going away. The AI model is not going away. Running both, with proper weighting and a filter that requires consensus, is the most defensible approach I know of.

That is what v2.1 does. The sources are all public. The papers are all findable. The data is free. The edge is in using it systematically when most traders are not using it at all.