TL;DR / Key Takeaways

NOAA's AIGEFS (Project EAGLE, built on Google DeepMind's GraphCast architecture) became operational in December 2025, replacing experimental status with live operational output.
Published verification data shows GraphCast-class models outperforming GFS on 500 hPa geopotential height at lead times beyond 72 hours, with the gap widening at 5-7 days.
The Weather Bot v2.1 pulls AIGEFS directly from NOAA's public AWS S3 bucket, the same source NOAA uses operationally, with no authentication required.
The Predict & Profit ensemble weights ECMWF IFS highest (0.30) and treats AIGEFS as a confirmation source (0.25), a deliberate choice based on published skill scores, not preference.

In December 2025, NOAA moved its AI-enhanced global forecast system from experimental to operational. If you trade Kalshi temperature markets using a single-model forecast, that transition matters more than you probably realize.

This post covers what AIGEFS actually is, what the published verification numbers say, and how those numbers influenced the ensemble weighting in Weather Bot v2.1.

What AIGEFS Is and Where It Came From

NOAA's AIGEFS is the operational name for what internally was called Project EAGLE. The underlying architecture is Google DeepMind's GraphCast, a graph neural network trained on 39 years of ERA5 reanalysis data (1979 to 2018) at 0.25-degree resolution.

GraphCast was published in Science in November 2023. DeepMind's paper showed it outperforming ECMWF's deterministic HRES on 90% of the 1380 verification targets tested, including 10-meter wind speed, 2-meter temperature, and 500 hPa geopotential height at multiple lead times. That result got NOAA's attention.

Project EAGLE took GraphCast and adapted it for operational use, adding a 31-member ensemble by perturbing initial conditions (similar in concept to how GFS generates ensemble spread, though the perturbation strategy differs). NOAA ran it in parallel with their physics-based systems for most of 2025 before declaring it operational in December.

The operational output publishes to a public AWS S3 bucket: noaa-nws-graphcastgfs-pds.s3.amazonaws.com. No authentication. No rate limiting. GRIB2 format with standard .idx index files.

What the Verification Data Actually Shows

Let me be specific, because "AI beats physics model" is a claim that gets thrown around loosely and deserves scrutiny.

The DeepMind Science paper (Lam et al., 2023) compared GraphCast against ECMWF HRES (the deterministic high-resolution model, not the ensemble IFS) on ERA5 verification targets. Key headline results:

500 hPa geopotential height (Z500): GraphCast outperforms HRES at all lead times beyond roughly 3 days (72 hours), with the skill gap widening through day 7 and day 10.
850 hPa temperature (T850): GraphCast outperforms HRES beyond approximately 2-3 days.
2-meter temperature (T2m): Mixed. GraphCast performs comparably to HRES in the medium range but shows larger errors than HRES at short ranges (under 24 hours). This matters for temperature trading specifically.

The short-range T2m finding is worth sitting with. For a 24-hour temperature market on Kalshi, GraphCast's advantage over GFS may be modest. For a 3-to-5-day market, the advantage becomes more meaningful.

ECMWF published their own verification of GraphCast and similar AI models (their blog, January 2024, comparing against their own HRES). They confirmed the skill at medium range while noting that AI models tend to produce smoother, less extreme forecasts ("regression to the mean" behavior), which can underestimate tails in high-impact weather events.

That tail behavior is a direct concern for trading bots. A market asking "will the high exceed 95°F?" sits in the tail of the temperature distribution. A model that systematically smooths extremes will underestimate the probability of those outcomes.

NOAA's own verification reports for AIGEFS through early 2026 show consistent improvement over the operational GFS at 3-to-7-day lead times on Z500 and T850, with more variable results at T2m at the surface. Those reports are available through NOAA's Environmental Modeling Center (EMC) verification pages, though they update irregularly.

How This Affects Ensemble Weighting

Weather Bot v2.1 runs four sources with the following weights:

ECMWF IFS: 0.30
GFS: 0.25
NOAA AIGEFS: 0.25
ECMWF AIFS-ENS: 0.20

The ECMWF IFS weight is highest. That is a deliberate choice. The IFS ensemble (51 members) has the longest and most rigorously documented verification record of any operational global ensemble. ECMWF consistently leads NOAA's GFS on ensemble mean RMSE and ensemble spread calibration. That has been true for most of the past decade across independent verification studies, including the WMO's global NWP index.

AIGEFS gets the same weight as GFS (0.25), not because I think they are equivalent but because AIGEFS is newer and its operational verification record is shorter. The published skill numbers are promising. The operational track record is about six months. Weighting a newer system equal to a 30-year-old ensemble that has been tuned extensively is a reasonable prior. I am not discounting it. I am not overweighting it either.

ECMWF AIFS-ENS (0.20) is ECMWF's own AI-based ensemble, also built on a neural network architecture (their Pangu-Weather lineage). It went operational at ECMWF in late 2024. It gets the lowest weight for the same reason as AIGEFS: shorter operational track record, though the underlying skill scores are strong.

The agreement filter in the bot (--min-agreement 3) means all four sources need to agree directionally on at least 3 of 4 votes before a trade fires. So even if AIGEFS disagrees with IFS on a borderline call, the trade gets blocked. The weights determine the composite probability. The agreement filter is the kill switch.

Pulling AIGEFS From the Operational S3 Bucket

The bot's AIGEFS module fetches directly from NOAA's public S3 bucket using GRIB2 byte-range requests. The .idx files list byte offsets for each variable, so we only download the variables we need rather than pulling a full GRIB file.

import requests
import eccodes
import io

AIGEFS_BUCKET = "https://noaa-nws-graphcastgfs-pds.s3.amazonaws.com"

def fetch_aigefs_member(cycle_str: str, member: int, lead_hour: int, variable: str) -> float:
    """
    Fetch a single variable from AIGEFS ensemble output.
    cycle_str: e.g. '2026051800' (YYYYMMDDCC)
    member: 0-30 (31 total members)
    lead_hour: forecast lead in hours
    variable: GRIB2 shortName, e.g. '2t' for 2-meter temperature
    Returns: scalar value at surface level (nearest gridpoint lookup done separately)
    """
    idx_url = (
        f"{AIGEFS_BUCKET}/graphcastgfs.{cycle_str[:8]}/{cycle_str[8:10]}/"
        f"graphcastgfs.t{cycle_str[8:10]}z.pgrb2.0p25.f{lead_hour:03d}.{member:02d}.idx"
    )

    idx_resp = requests.get(idx_url, timeout=30)
    idx_resp.raise_for_status()

    byte_start = None
    byte_end = None

    lines = idx_resp.text.strip().split("\n")
    for i, line in enumerate(lines):
        parts = line.split(":")
        if variable in parts[3]:  # shortName field
            byte_start = int(parts[1])
            if i + 1 < len(lines):
                byte_end = int(lines[i + 1].split(":")[1]) - 1
            break

    if byte_start is None:
        raise ValueError(f"Variable {variable} not found in AIGEFS index for member {member}")

    headers = {"Range": f"bytes={byte_start}-{byte_end}" if byte_end else f"bytes={byte_start}-"}
    grib_url = idx_url.replace(".idx", "")
    grib_resp = requests.get(grib_url, headers=headers, timeout=60)
    grib_resp.raise_for_status()

    with io.BytesIO(grib_resp.content) as buf:
        msg_id = eccodes.codes_grib_new_from_file(buf)
        values = eccodes.codes_get_array(msg_id, "values")
        eccodes.codes_release(msg_id)

    # Gridpoint selection is handled upstream by lat/lon index lookup.
    # Return full array here; caller extracts the relevant point.
    return values

One thing that burned me early: NOAA's NOMADS server (nomads.ncep.noaa.gov) rate-limited the VM after roughly 868 parallel GRIB requests during initial testing. NOMADS is not designed for bulk programmatic access at the volume a trading bot generates during a scan cycle. The S3 bucket has no such limit. It is the same data. Use S3.

The GRIB filter endpoint filter_aigefs_0p25.pl that NOMADS advertises for AIGEFS does not exist. I found that out by getting 404s for an hour. The S3 approach is cleaner anyway.

What This Means Practically for Kalshi Temperature Trades

Kalshi's temperature markets resolve against official NOAA ASOS station readings. Most markets I trade are for major cities at 24-to-72-hour lead times. That puts us in the zone where AIGEFS shows measurable skill improvement over the raw GFS, but not necessarily over the ECMWF IFS ensemble.

The practical implication: AIGEFS adds the most value as a disagreement detector. When AIGEFS diverges from GFS on direction, it is more likely that GFS has it wrong than in the pre-2025 world when both were physics-based with similar architectures. When AIGEFS agrees with ECMWF IFS and disagrees with GFS, the agreement filter now has three votes against one. That is a strong signal to stand down on a GFS-driven trade, or take the other side.

Convergence across all four sources, including both the physics-based GFS and the two AI systems, is the strongest confidence signal the bot generates. That is where the edge is sharpest.

What I Am Still Watching

The tail smoothing issue has not gone away. AIGEFS, like all neural network forecast systems trained on mean-squared-error-type objectives, tends to produce probabilistic output that underestimates extreme values. ECMWF has been working on this with their AI systems. NOAA has less published data on whether AIGEFS shows the same bias.

For markets asking about extreme temperatures (record highs, extended heat events), I apply additional skepticism to AI model probabilities. The physics-based GFS handles tails better in certain synoptic patterns, partly because it is explicitly solving the equations of atmospheric motion rather than interpolating from training data patterns.

I am also watching NOAA's EMC verification pages for the first full summer cycle of operational AIGEFS data. Summer convective patterns are different from the large-scale mid-latitude flow regimes that dominate the training distribution. The verification literature on GraphCast is heavier on winter and shoulder-season data. Summer 2026 will be the first real test of operational AIGEFS in peak convective season.

The December 2025 AIGEFS operational declaration is a real milestone, not marketing. The verification data supports it. The weight I gave it in the ensemble reflects where the evidence actually sits today, not where I expect it to sit in two years. When the operational track record extends and the summer verification data comes in, those weights will get revisited.

The data is public. The S3 bucket is free. The edge is in using it systematically when most traders are still looking at a single GFS run on Weather.com.

NOAA's GraphCast Goes Operational: What It Means for Weather Trading Bots