Why I Added 62 Unit Tests to a Bot I Already Trusted
TL;DR / Key Takeaways
- A regime-change false positive and a strike consistency bug both caused real trading errors in the Kalshi econ bot before I caught them.
- Probabilistic trading logic is harder to test than CRUD code because "correct" is a distribution, not a single value.
- I added 28 tests for regime detection and 34 for strike consistency enforcement, totaling 62 new tests for two modules alone.
- The Predict & Profit Econ Bot ships with this full test suite so buyers can verify behavior before running with real money.
I trusted the econ bot. I had watched it run. I had read the code. I thought I knew what it was doing.
Then I started seeing trades I did not expect.
The bot was holding contradictory positions on the same CPI release. Long "above 3.2%" and long "above 3.4%" at the same time, which makes no sense as a bet structure. And separately, the regime-change detector was firing on small nowcast fluctuations and closing positions that should have stayed open.
Both bugs were in code I had already reviewed. Both were subtle. Both caused real financial behavior I did not intend.
That is when I stopped trusting my own read of the code and started writing tests.
Why Trading Bots Are a Worse Place to Skip Tests
Most software fails loudly. A web app throws a 500. A CLI crashes with a traceback. You see it, you fix it.
A trading bot fails quietly. It places the wrong orders. It skips valid opportunities. It closes winning positions early. The bot keeps running, logging looks clean, and you have no idea anything is wrong unless you audit every trade by hand.
I am not auditing every trade by hand. That defeats the entire point.
The bot runs on a VPS in Dallas while I am in Atlanta doing my day job at QTS. Nobody is watching it. The logs tell me what happened, but they can not tell me what should have happened. Only a test suite can do that.
If the code is wrong and there are no tests, the wrong behavior becomes the definition of correct. You calibrate your expectations to the broken thing. That is a bad place to end up.
The Two Bugs That Triggered This
Bug one: regime detection false positive.
regime_change.py monitors the Cleveland Fed CPI nowcast. When the nowcast shifts significantly, the bot is supposed to close positions that are now on the wrong side of the market. Good feature. The threshold for "significant" was 0.15 probability swing.
The problem was in how I was calculating the swing. I was comparing the current nowcast to the last value the bot had seen, not the last value at the time the position was opened. A slow drift of 0.05 over three cycles looked like a 0.15 jump if the intermediate values had not been cached correctly.
The bot was closing positions it had held for two hours because the nowcast had moved 0.05 total but the comparison window was stale.
Bug two: strike consistency failure.
The strike consistency enforcer is supposed to prevent the bot from holding logically contradictory positions on the same event. If CPI prints at 3.3%, then "above 3.2%" resolves YES and "above 3.4%" resolves NO. Holding both is not a hedge. It is incoherence.
The enforcer was checking for duplicates by strike ticker only, not by direction. Two positions with the same strike but opposite sides were passing the consistency check because the comparison was too narrow.
Neither bug was obvious in code review. Both were obvious the moment I wrote a test that described the expected behavior in plain terms and watched it fail.
What 28 Tests for Regime Detection Actually Cover
The regime detection module needs to answer one question: is this nowcast shift large enough and sustained enough to close my positions?
That sounds simple. It is not.
Here is a representative test:
import pytest
from unittest.mock import patch, MagicMock
from datetime import datetime, timedelta
from regime_change import RegimeChangeDetector
class TestRegimeDetector:
def setup_method(self):
self.detector = RegimeChangeDetector(threshold=0.15)
def test_no_trigger_on_gradual_drift(self):
"""
Slow drift across three cycles should NOT trigger a close.
Each individual step is below threshold.
Total movement could exceed threshold if measured wrong.
"""
position_opened_at = 0.42 # nowcast when position was placed
cycle_1 = 0.47 # +0.05, no trigger
cycle_2 = 0.51 # +0.09 from open, no trigger
cycle_3 = 0.55 # +0.13 from open, no trigger
cycle_4 = 0.58 # +0.16 from open, SHOULD trigger
self.detector.record_position_baseline("CPI-23-A-3.2", position_opened_at)
assert not self.detector.should_close("CPI-23-A-3.2", cycle_1)
assert not self.detector.should_close("CPI-23-A-3.2", cycle_2)
assert not self.detector.should_close("CPI-23-A-3.2", cycle_3)
assert self.detector.should_close("CPI-23-A-3.2", cycle_4)
def test_trigger_uses_position_baseline_not_last_seen(self):
"""
The comparison anchor must be the nowcast at position open,
not the most recent nowcast value.
This is the exact bug that caused false positives in production.
"""
self.detector.record_position_baseline("CPI-23-A-3.4", 0.60)
# Nowcast drifts slowly, then snaps back
self.detector.should_close("CPI-23-A-3.4", 0.63) # +0.03
self.detector.should_close("CPI-23-A-3.4", 0.66) # +0.06
self.detector.should_close("CPI-23-A-3.4", 0.69) # +0.09
# If the bot used last-seen (0.69) instead of baseline (0.60),
# the next value of 0.71 would look like only +0.02 movement.
# It is actually +0.11 from baseline, still below threshold.
result = self.detector.should_close("CPI-23-A-3.4", 0.71)
assert not result # 0.71 - 0.60 = 0.11, below 0.15 threshold
def test_missing_baseline_does_not_close(self):
"""
If no baseline was recorded for a position ticker,
the detector should not close it. Fail safe, not fail open.
"""
result = self.detector.should_close("CPI-23-A-UNKNOWN", 0.80)
assert not result
The third test is the one I care most about: fail safe, not fail open. If the detector has no record of a position baseline, it does nothing. It does not close the trade on a guess. This is the same principle as every kill switch in the bot. When in doubt, do nothing.
The remaining 25 tests cover: simultaneous positions on different events, threshold edge cases at exactly 0.15, time-windowed detection, serialization of the baseline cache, and behavior on empty nowcast responses from the Cleveland Fed scraper.
What 34 Tests for Strike Consistency Cover
The strike consistency module has a harder job. It has to model the logical relationships between CPI strikes and decide whether a proposed new position contradicts something already held.
CPI above 3.2% and CPI above 3.4% are not independent. If above 3.4% resolves YES, then above 3.2% also resolves YES. The market treats them as separate contracts, but they are correlated bets on the same underlying print.
The enforcer needs to understand direction, not just ticker.
def test_contradictory_no_positions_same_event(self):
"""
Holding NO on 3.2% and NO on 3.4% simultaneously is contradictory.
If CPI prints at 3.3%, the 3.2% NO loses and the 3.4% NO wins.
If CPI prints at 3.5%, both NOs lose.
If CPI prints at 3.1%, both NOs win.
This is NOT a spread. This is an incoherent book.
Block it.
"""
enforcer = StrikeConsistencyEnforcer()
enforcer.add_position("CPI-23-A-3.2", side="no", contracts=5)
result = enforcer.check_proposed("CPI-23-A-3.4", side="no", contracts=3)
assert result.allowed is False
assert "contradictory" in result.reason.lower()
def test_valid_spread_is_allowed(self):
"""
Long YES on 3.2% and long YES on 3.4% is a valid graduated position.
You believe CPI is above 3.2% and possibly above 3.4%.
The higher strike is a more aggressive version of the same thesis.
Allow it.
"""
enforcer = StrikeConsistencyEnforcer()
enforcer.add_position("CPI-23-A-3.2", side="yes", contracts=5)
result = enforcer.check_proposed("CPI-23-A-3.4", side="yes", contracts=3)
assert result.allowed is True
def test_opposite_sides_same_strike_blocked(self):
"""
Long YES and long NO on the same strike is a direct contradiction.
This should never happen. Block it hard.
"""
enforcer = StrikeConsistencyEnforcer()
enforcer.add_position("CPI-23-A-3.2", side="yes", contracts=5)
result = enforcer.check_proposed("CPI-23-A-3.2", side="no", contracts=3)
assert result.allowed is False
The remaining 31 tests cover: multi-event isolation (FOMC positions should not affect CPI checks), position removal on settlement, the event position cap interaction, edge cases with partial fills, and the logging format so I can actually read what was blocked and why.
The Pattern for Testing Probabilistic Logic
Most unit test tutorials cover pure functions: input goes in, output comes out, check equality. Trading logic is messier. The output is a probability estimate, a decision under uncertainty, a comparison against a threshold.
A few patterns that work:
Test the decision boundary, not the probability. You do not care that the nowcast is 0.634521. You care that 0.60 to 0.76 triggers a close and 0.60 to 0.74 does not. Write tests at the boundary values, not random internals.
Test the failure modes first. The three most important test cases for any trading module: what happens when the API returns nothing, what happens when the data is stale, and what happens when the module has never seen this ticker before. These are the cases that crash bots at 2am. Write them before you write the happy path.
Name the tests after the behavior, not the function. test_regime_trigger_uses_position_baseline_not_last_seen is useful. test_should_close_returns_false is useless. The test name is documentation. It is the thing that tells you, six months later, what you were protecting against.
Use comments that describe the market logic. Code comments that explain why a trade is incoherent, not just what the assertion checks. The bug will come back in a different form. The comment tells the next reader (you, at midnight) what invariant you are preserving.
62 Tests Is Not the Point
The number is not impressive. Production codebases have thousands of tests. 62 is a small suite.
The point is that both bugs I described were real, they were in code I had reviewed, and they were causing the bot to do things I did not intend. The tests caught the bugs on regression. More importantly, the tests defined what "correct" means so that when I refactor the regime detector or touch the consistency enforcer, I know immediately if I broke something.
The bot runs without me. That is the design. But "runs without me" has to mean "runs correctly without me", not just "runs without crashing."
There is a difference. The test suite is how you enforce it.
If you buy the source code, the full suite ships with it. Run pytest tests/ before you configure a single API key. Watch it pass. Then you know what you are starting with.