< Back to Blog

AWS S3 as a Free Weather Data Pipeline: How NOAA Publishes Forecast Data Without Rate Limits

TL;DR / Key Takeaways

  • NOAA publishes its operational AIGEFS (GraphCast-based) forecast data to a public AWS S3 bucket with no authentication and no rate limits.
  • The GRIB2 .idx index file trick lets you download only the variables you need instead of pulling full multi-gigabyte GRIB files.
  • NOMADS, NOAA's legacy server, rate-limited my VM at 868 parallel requests and the AIGEFS filter endpoint doesn't even exist there.
  • The Predict & Profit Weather Bot v2.1 pulls from noaa-nws-graphcastgfs-pds.s3.amazonaws.com using this exact method as part of its 164-member grand ensemble.

I spent an embarrassing amount of time trying to get AIGEFS data out of NOMADS before I accepted the obvious: NOAA moved on and didn't really tell anyone.

The NOMADS server at nomads.ncep.noaa.gov is how most people learn to pull NOAA model data. There are tutorials everywhere. The GRIB filter interface is documented. It works fine for GFS if your access pattern is light. But when I tried to pull AIGEFS data at any volume, two things happened in quick succession.

First, the filter endpoint filter_aigefs_0p25.pl returned 404. It does not exist for AIGEFS. The GRIB filter system at NOMADS was never set up for this model. Second, after my VM attempted 868 parallel GRIB requests across forecast hours and ensemble members, NOMADS rate-limited the entire IP. Zero successful fetches. The trading cycle failed silently.

The fix took about 20 minutes once I stopped looking at NOMADS entirely.


The Bucket

NOAA publishes AIGEFS data to a public AWS S3 bucket:

noaa-nws-graphcastgfs-pds.s3.amazonaws.com

No API key. No AWS credentials required. It's a public requester-pays bucket, but for read access via HTTPS it's completely open. This is the same bucket NOAA uses operationally. What goes into production goes here. You're not pulling a delayed copy or a degraded mirror. You're pulling the actual data.

The path structure looks like this:

/graphcastgfs.YYYYMMDD/HH/atmos/graphcastgfs.tHHz.pgrb2s.0p25.fFFF.idx

Where YYYYMMDD is the model run date, HH is the cycle (00, 06, 12, 18), FFF is the forecast hour (000 through 240 in 6-hour steps for AIGEFS), and the .idx suffix is the index file.

The ensemble members live under a slightly different prefix, but the structure is consistent enough that you can template it cleanly in Python.


Why the .idx File Is the Whole Point

A full AIGEFS GRIB2 file for a single forecast hour and member can run 200MB or more depending on what variables are packed in. If you're running an ensemble of 31 members across multiple forecast hours, you are not downloading gigabytes per cycle. That's not a pipeline, that's a disaster.

The .idx file is a plain-text index of every variable in the corresponding GRIB2 file. Each line tells you the byte offset where that variable starts and the byte offset where the next variable starts. That gives you a byte range. HTTP supports byte-range requests via the Range header. You request only the bytes you need.

Here's what a few lines of a GRIB2 index file look like:

1:0:d=2025120100:TMP:2 m above ground:3 hour fcst:
2:143456:d=2025120100:UGRD:10 m above ground:3 hour fcst:
3:287112:d=2025120100:VGRD:10 m above ground:3 hour fcst:
4:430768:d=2025120100:TMP:surface:3 hour fcst:

Column two is the byte offset. To get TMP at 2m, you request bytes 0 through 143455. To get UGRD at 10m, you request bytes 143456 through 287111. The HTTP server at S3 handles this without complaint.


The Python Implementation

Here's the core of how the Weather Bot pulls a single variable from S3. This is a cleaned-up version of the actual production code.

import requests
import re
import tempfile
import cfgrib
import xarray as xr

BUCKET_BASE = "https://noaa-nws-graphcastgfs-pds.s3.amazonaws.com"

def build_s3_url(run_date: str, cycle: str, member: int, fhour: int) -> tuple[str, str]:
    """
    Returns (grib_url, idx_url) for a given AIGEFS member and forecast hour.
    run_date: YYYYMMDD
    cycle: '00', '06', '12', '18'
    member: 0-30 (AIGEFS has 31 ensemble members)
    fhour: forecast hour (0, 6, 12, ... 240)
    """
    fhour_str = f"{fhour:03d}"
    member_str = f"{member:02d}"
    prefix = f"graphcastgfs.{run_date}/{cycle}/atmos/mem{member_str}"
    fname = f"graphcastgfs.t{cycle}z.pgrb2s.0p25.f{fhour_str}"
    grib_url = f"{BUCKET_BASE}/{prefix}/{fname}"
    idx_url = f"{grib_url}.idx"
    return grib_url, idx_url


def get_byte_range(idx_url: str, target_var: str, target_level: str) -> tuple[int, int] | None:
    """
    Parse the .idx file and return (start_byte, end_byte) for the target variable.
    Returns None if the variable isn't found.
    """
    resp = requests.get(idx_url, timeout=15)
    if resp.status_code != 200:
        return None

    lines = resp.text.strip().splitlines()
    offsets = []

    for line in lines:
        parts = line.split(":")
        if len(parts) < 6:
            continue
        offset = int(parts[1])
        var = parts[3]
        level = parts[4]
        offsets.append((offset, var, level))

    for i, (offset, var, level) in enumerate(offsets):
        if var == target_var and target_level in level:
            start = offset
            end = offsets[i + 1][0] - 1 if i + 1 < len(offsets) else offset + 2_000_000
            return start, end

    return None


def fetch_grib_variable(
    grib_url: str,
    idx_url: str,
    target_var: str,
    target_level: str,
    lat: float,
    lon: float
) -> float | None:
    """
    Fetch a single GRIB2 variable via byte-range request and extract
    the value at the nearest grid point to (lat, lon).
    """
    byte_range = get_byte_range(idx_url, target_var, target_level)
    if byte_range is None:
        return None

    start, end = byte_range
    headers = {"Range": f"bytes={start}-{end}"}
    resp = requests.get(grib_url, headers=headers, timeout=30)

    if resp.status_code not in (200, 206):
        return None

    with tempfile.NamedTemporaryFile(suffix=".grb2", delete=False) as tmp:
        tmp.write(resp.content)
        tmp_path = tmp.name

    try:
        ds = cfgrib.open_dataset(tmp_path)
        da = ds[list(ds.data_vars)[0]]

        # Normalize longitude to 0-360 if needed
        if lon < 0:
            lon = lon + 360.0

        # Use explicit None check, not truthiness, on xarray objects
        lat_idx = int(abs(da.latitude.values - lat).argmin())
        lon_idx = int(abs(da.longitude.values - lon).argmin())

        value = float(da.values[lat_idx, lon_idx])
        return value
    except Exception as e:
        print(f"GRIB parse error: {e}")
        return None

The explicit None check on xarray objects in that last block is not an accident. If you use if da.latitude or a Python or to fall back, xarray raises ValueError: The truth value of an array is ambiguous. I hit this in production. It killed the AIGEFS module silently until I caught it in the logs. The fix is always explicit: if da.latitude is not None.


Boto3 vs Direct HTTPS

People ask whether to use boto3 or raw HTTPS for public buckets. The answer is raw HTTPS for this use case.

Boto3 is the right tool when you need AWS credentials, when you're uploading, or when you're doing bulk listing operations that benefit from the SDK's pagination handling. For reading from a public bucket by URL, boto3 adds overhead and a dependency without giving you anything in return.

Direct HTTPS with requests is faster to set up, easier to debug, and the URLs are stable enough to template. The byte-range trick works cleanly over HTTP. S3 returns a 206 Partial Content response for range requests, which requests handles without any special configuration.

One thing to be aware of: S3 occasionally returns a 503 on high-traffic buckets during peak model run windows (right after the 00Z and 12Z cycles post). Build in a retry with exponential backoff. Three retries with 2s, 4s, 8s delays handles nearly every transient error I've seen.

import time

def fetch_with_retry(url: str, headers: dict, max_retries: int = 3) -> requests.Response | None:
    delay = 2
    for attempt in range(max_retries):
        try:
            resp = requests.get(url, headers=headers, timeout=30)
            if resp.status_code in (200, 206):
                return resp
            if resp.status_code == 503:
                time.sleep(delay)
                delay *= 2
                continue
            return None
        except requests.RequestException:
            time.sleep(delay)
            delay *= 2
    return None

That's it. No boto3. No AWS SDK. No credentials. 868 out of 868 downloads succeed on the current bot.


Why NOMADS Fails at Scale

NOMADS isn't going away and it's not broken for all use cases. If you're pulling one or two GFS files interactively, it works fine. The problems start when you automate.

The rate limiting is not documented clearly. You don't get a 429 with a Retry-After header. You get timeouts, connection resets, and partial responses that corrupt your GRIB files in ways that are hard to detect without checksum validation. My VM hit the limit mid-cycle and the bot logged partial data as valid. That's worse than a clean failure.

The filter endpoint situation with AIGEFS is a separate issue. NOAA built AIGEFS and pushed it to S3 as the delivery mechanism. The NOMADS GRIB filter system was not updated to match. If you're looking for filter_aigefs_0p25.pl, it simply isn't there. The canonical source for AIGEFS is S3. That's it.

For any automated pipeline running on a schedule, S3 is the correct answer. It's CDN-backed, it handles concurrent requests cleanly, and it's the same data. There's no reason to use NOMADS for AIGEFS.


What the Bot Does With This

The Weather Bot v2.1 runs this S3 pipeline as one of four ensemble sources. After pulling the AIGEFS member data, it computes a probability distribution over the forecast temperature range and compares it against the other three sources: GFS via Open-Meteo, ECMWF IFS, and ECMWF AIFS-ENS.

The ENSEMBLE_COMPARE log line shows the per-source probability for every trade candidate. A trade only fires if at least 3 of the 4 sources agree on direction. AIGEFS gets a 0.25 weight in the grand ensemble, same as GFS. ECMWF IFS gets the highest weight at 0.30 because it consistently outperforms on global skill scores.

The S3 pipeline is what made adding AIGEFS practical. The NOMADS dead end wasn't a temporary obstacle, it was a signal that NOMADS wasn't the intended delivery path for this data.


One More Thing on Data Freshness

NOAA posts AIGEFS data to S3 within roughly 3-4 hours of the model run initialization time. The 00Z cycle is usually available by 04:00-05:00 UTC. The bot checks availability by requesting the .idx file for the target run and treating a 200 response as confirmation the data is posted.

Don't poll the bucket on a tight loop. Check once every 10-15 minutes after the expected post time. S3 request costs on a public bucket are technically on the requester for LIST operations, but individual GET requests to known URLs are free from the data consumer's side. Still, there's no reason to hammer it.

The public data infrastructure NOAA has built here is genuinely good. Free, reliable, no authentication, operational quality data. Most retail traders don't know it exists. That asymmetry is the whole point.