Month 1 Results: The Honest Numbers From My Automated Weather Trading Bot
TL;DR / Key Takeaways
- The first month produced real sales, limited live trading data, and several operational lessons.
- Early bot performance included losses, which are published instead of hidden.
- Month 2 changes focus on stricter filters, improved ensemble agreement, and better execution discipline.
- The transparency report is meant to set realistic expectations for buyers and builders.
I built this whole brand around honesty. So let me do the thing most trading content creators never do: show you the actual numbers, including the bad ones.
This is the transparency post for the first real month running the Predict & Profit weather bot live on Kalshi prediction markets. Not a backtest. Not a simulation. Real trades, real money, real results.
I am not going to cherry-pick the good weeks and bury the bad ones. That is not how this works. If the system is worth anything, the honest numbers will show it. If it is not, you deserve to know that too.
The Setup
For context on what I was running: the bot trades Kalshi temperature contracts across roughly 14 US cities. It uses a 62-member HGEFS hybrid ensemble model to score each market on four factors: ensemble spread, confidence level, model vs market gap, and fee efficiency. Trades only get placed when all four filters clear their minimum thresholds.
The bot runs 24/7, checks for new GFS ensemble data when it updates, evaluates all active markets, and places orders automatically through the Kalshi API using RSA-PSS authentication. Every trade is logged to PostgreSQL including the entry probability, the model probability, the number of contracts, and the outcome.
That logging is what makes this post possible. Without it I would be guessing.
The Numbers
| Metric type | What it measures | Why it matters | Month 2 response | | --- | --- | --- | --- | | Sales | Buyer demand for source code and ebook | Validates product interest | Keep publishing technical proof | | Live trades | Early resolved bot outcomes | Tests execution and model filters | Tighten marginal trade requirements | | Rejected trades | Opportunities the bot skipped | Shows filter discipline | Track whether filters are too strict or too loose |
Here is what the first month looked like.
Trades evaluated: several hundred across the full period. Most were rejected by the filters. That is expected and by design, as I covered in the post on why 90 percent of trades get filtered out.
Trades actually placed: in the low dozens. Some weeks had 3 to 4 trades. Some had zero. That variance is real and it is not a bug. Weather markets have good periods for the ensemble and bad ones, and the filters are designed to wait out the bad ones.
Win rate on completed trades: in the range of 60 to 65 percent. Not every win was a big winner and not every loss was a big loser. Prediction market contracts pay out based on whether the event resolves in your favor at expiration, so the math is relatively clean.
Net P&L for the month: modestly positive. Not retirement money. Not a loss. The bot made more than it lost, which after fees is the whole game at this stage.
I am being deliberately rough with these numbers and I want to explain why. The bot is still in active development. Publishing very specific numbers creates a record that new buyers will anchor to, and the performance in month two or month six may look different. What matters is the direction and the process, not the exact dollar figure in month one.
What Went Wrong
There were losing trades. Let me walk through the shape of a few of them.
The most common failure mode was a fast market move before the GFS data updated. The ensemble would show strong edge at the last update cycle. By the time the next cycle confirmed it, the market had already repriced and the edge was gone. The bot placed the trade based on the previous signal and lost.
This is a data latency problem. The GFS updates every six hours. Markets can move faster than that, especially if a major weather event is developing. The bot is working with six-hour-old ensemble data sometimes, and the market has moved on.
The second failure mode was overconfidence in specific cities. Temperature forecast accuracy varies a lot by geography. A coastal city with complex sea breeze dynamics is harder to forecast than a midcontinent city with stable air masses. The bot does not know this. It treats Atlanta and San Francisco as equally forecastable. They are not.
Both of these are known problems and both are on the improvement list.
What Went Right
The fee filter worked exactly as designed. On contracts where the fee would have consumed more than a set percentage of the expected edge, the bot skipped the trade. Several of those would have been losers. I cannot know for certain because I did not take them, but the math on where those contracts resolved suggests the filter saved me money.
The ensemble confidence filter also held up. Every trade the bot rejected because confidence was below 0.30 would have needed to resolve correctly to be profitable. In the cases I tracked manually, a meaningful share of those would have been losses. Skipping them was correct.
The automation itself worked without incident. No crashes, no missed trades due to technical failure, no API authentication issues. The PostgreSQL logging ran clean. That part is boring to describe but it matters. A bot that fails silently is a money pit.
What I Am Changing for Month 2
Three things.
First, I am adding a market recency check. If the market price has moved more than a set threshold since the last GFS update cycle, the bot treats the previous ensemble signal as stale and skips the trade. This directly addresses the latency problem.
Second, I am building a city-level calibration layer. Over time the PostgreSQL logs will tell me which cities the bot performs better on. I can start adjusting position sizing by city based on historical win rate, not just ensemble signal strength.
Third, I am tightening the minimum ensemble edge from 0.10 to 0.12 for a test period to see if the higher bar improves win rate enough to offset the lower volume. This is the kind of tuning you can only do with real trade logs.
The Bigger Picture
One month of live trading proves exactly one thing: the system works mechanically and produces a positive result in a limited sample. That is it. It does not prove the edge is permanent. It does not prove the win rate holds at scale. Weather markets are seasonal and the ensemble performs differently in summer versus winter.
I started this project because I wanted to know if a data-driven, automated approach could find real edges in prediction markets using free public data. After month one, the honest answer is: probably yes, and I need more data.
That is not hype. It is where I actually am.
If you want to build and run this system yourself, the full Python source code is $67 at predictandprofit.gumroad.com. You get the scoring engine, the API integration, the PostgreSQL logging schema, and the filter logic. Everything I ran this month.
Frequently Asked Questions
Q: What should be evaluated after the first month of a trading bot?
A: Evaluate sales, live trade outcomes, rejected trades, execution issues, and whether the model behaved as expected. Raw P&L alone is too small a sample early on.
Q: Why publish early losses?
A: Publishing losses prevents survivorship bias and gives buyers realistic expectations. It also forces the system owner to analyze mistakes instead of marketing around them.
Q: What changed after Month 1?
A: The system moved toward stricter ensemble agreement, better filters, and more disciplined evaluation. The goal was fewer marginal trades, not more activity.