TL;DR / Key Takeaways

Corporate on-call rotations socialize the cost of unreliable systems across engineers who did not build those systems and have limited power to fix them.
Self-healing automation is not about avoiding responsibility — it is about building systems that define and handle failure explicitly at design time rather than reactively at 3 AM.
Exponential backoff, structured logging, and watchdog processes are the engineering patterns that let a trading bot run unattended for weeks without human intervention.
The goal of a well-built solo system is not that nothing ever fails. It is that when something fails, the system knows what to do before you do.

There is a particular silence at 3 AM when your phone goes off and you know before you even open your eyes what is happening. The PagerDuty alert. The service name. The brief moment where you map the service name to which team owns it, whether you are on rotation for it this week, and whether the runbook has any actual instructions or just a link to a Confluence page that was last updated in 2021.

I did this for years. Not because I was bad at building reliable systems, but because corporate engineering organizations use on-call rotations as a pressure release valve. When a system is fragile and nobody wants to prioritize fixing it properly, you put it on rotation. The rotation absorbs the failures so the service never gets marked as a P1 that requires real investment. The engineers on rotation fix the symptom every few weeks. The underlying problem persists until someone burns out badly enough that it becomes a staffing issue instead of a reliability issue.

I am not on rotation anymore. I have not been paged at 3 AM in over a year. My trading bot runs 24 hours a day on a headless Ubuntu VM, and when something goes wrong, it handles it.

What "Handle It" Actually Means

The self-healing part sounds more impressive than it is. It is not AI. It is not some sophisticated fault tolerance framework. It is a series of explicit decisions made at design time about what failure looks like and what the system should do when it encounters each type.

The Open-Meteo API returns an HTTP error. The Alpaca WebSocket drops its connection. The Kalshi order returns a partial fill. These are not unknown failure modes. They are predictable, enumerable, and solvable with patterns that have existed in software engineering for decades. The reason corporate systems still page humans for them is usually organizational, not technical.

My weather bot makes one external API call every six hours to fetch GFS ensemble data. If that call fails, I do not want to be woken up. I want the bot to wait two seconds and try again. If it fails again, wait four seconds. Then eight. If it fails three times, skip this scoring cycle, log a structured warning to PostgreSQL, and wait for the next scheduled run.

# predictandprofit.io
import time
import logging
from typing import Callable, TypeVar

logger = logging.getLogger(__name__)
T = TypeVar("T")

def with_exponential_backoff(
    fn: Callable[[], T],
    max_attempts: int = 3,
    base_delay: float = 2.0,
    label: str = "operation",
) -> T:
    """
    Retry fn with exponential backoff.
    Raises the last exception if all attempts fail.
    """
    last_exc = None
    for attempt in range(max_attempts):
        try:
            return fn()
        except Exception as exc:
            last_exc = exc
            delay = base_delay * (2 ** attempt)
            logger.warning(
                "attempt_failed",
                extra={
                    "label": label,
                    "attempt": attempt + 1,
                    "max_attempts": max_attempts,
                    "delay_seconds": delay,
                    "error": str(exc),
                },
            )
            time.sleep(delay)
    raise RuntimeError(f"{label} failed after {max_attempts} attempts") from last_exc

That function lives in the utils module and wraps every external call the bot makes. When it logs a warning, it logs it as structured JSON. Not a string. Not a print statement. A dict that a monitoring query can aggregate, count, and surface if the failure rate crosses a threshold.

No page. No 3 AM alert. Just a row in a database that tells me what happened.

The Alpaca Reconnection Problem

The Alpaca market data WebSocket is the most frequent source of recoverable failures in the dual-bot setup. WebSocket connections drop. This is expected behavior on any long-lived TCP connection exposed to real network conditions. Corporate network stacks solve this by load balancers, keepalives, and infrastructure teams whose job is to manage the layer below your application. On a home server on a residential ISP, you solve it in application code.

The reconnection logic is a loop. Connect. Subscribe to the symbols you care about. Process events. When the connection drops — and it will drop — catch the exception, log it, wait briefly, reconnect. The bot does not panic. It does not send me a message. It reconnects.

# predictandprofit.io
import asyncio
from alpaca.data.live import StockDataStream

async def run_market_data_stream(symbols: list[str], handler, config: dict) -> None:
    """
    Persistent WebSocket loop with reconnection on failure.
    Runs indefinitely until the process is stopped.
    """
    consecutive_failures = 0
    max_consecutive = 10

    while True:
        try:
            stream = StockDataStream(
                api_key=config["api_key"],
                secret_key=config["secret_key"],
            )
            stream.subscribe_quotes(handler, *symbols)
            logger.info("websocket_connected", extra={"symbol_count": len(symbols)})
            consecutive_failures = 0
            await stream._run_forever()

        except Exception as exc:
            consecutive_failures += 1
            delay = min(2 ** consecutive_failures, 120)  # cap at 2 minutes
            logger.error(
                "websocket_disconnected",
                extra={
                    "error": str(exc),
                    "consecutive_failures": consecutive_failures,
                    "reconnect_delay_seconds": delay,
                },
            )

            if consecutive_failures >= max_consecutive:
                logger.critical(
                    "websocket_repeated_failure",
                    extra={"message": "Stopping reconnection loop after repeated failures"},
                )
                raise  # let systemd restart the entire process

            await asyncio.sleep(delay)

The consecutive_failures counter is the important part. Individual failures get a short delay and a reconnect. Repeated failures get escalating delays and eventually raise out of the loop entirely. Systemd catches the process exit and restarts the bot from scratch. The PostgreSQL log has a complete record of every failure and its timestamp. When I look at it in the morning, I can see what happened without anyone having to tell me.

The corporate equivalent of this is an on-call engineer being paged, SSHing into a server, killing and restarting a process by hand, updating a ticket, and sending a Slack message saying "restarted, monitoring." The bot does all of that except the Slack message.

Systemd as the Outer Loop

The bot runs as a systemd service. This is the layer below the application code. If the Python process exits cleanly, systemd restarts it. If it exits with an error, systemd restarts it. If it crashes hard enough that the error handling inside does not fire, systemd restarts it.

The unit file for the Kalshi weather bot is about fifteen lines. The relevant part:

[Service]
ExecStart=/home/steve/.venv/bin/python /home/steve/predict_and_profit/main.py
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal

Restart=always means systemd does not care about exit codes. The process stops, systemd waits ten seconds, systemd starts it again. Combined with the application-level retry logic inside the bot, you have two layers of recovery before any human needs to get involved.

I set this up once. I have not touched the unit file in months.

The Log Is the Postmortem

In a corporate on-call system, incidents generate tickets, which generate postmortems, which generate action items, which generate follow-up meetings to review whether the action items are done. The entire process exists because there is no single source of truth about what happened, so multiple people have to reconstruct it collaboratively.

When my bot has a failure event, the PostgreSQL bot_events table has the full record. Every retry attempt. Every reconnection. Every skipped scoring cycle. Every trade that was blocked because a filter was not satisfied. Every trade that executed. Timestamp accurate to the millisecond, structured fields that can be queried with a single SQL statement.

-- Morning check: what happened while I was asleep?
SELECT
    event_type,
    COUNT(*) AS count,
    MIN(created_at) AS first_seen,
    MAX(created_at) AS last_seen
FROM bot_events
WHERE created_at >= NOW() - INTERVAL '8 hours'
GROUP BY event_type
ORDER BY count DESC;

That query gives me a complete picture of the last eight hours in under two seconds. No ticket. No postmortem meeting. No reconstructing a timeline from three different people's memory.

This is not because I am a better engineer than the people running those postmortems. It is because I designed the system from the start to answer the question "what happened?" without human involvement in the data collection step.

What It Cost Me

There is a honest version of this post and the honest version admits that this system required real engineering investment upfront. The retry logic, the reconnection loop, the structured logging schema, the systemd configuration — none of that is complex individually, but designing it all to work together correctly took time and iteration.

Corporate on-call rotations persist not because organizations do not know how to build reliable systems, but because building reliable systems requires upfront investment that is hard to justify against quarterly delivery timelines. It is cheaper, in the short term, to put a human on rotation than to spend two weeks hardening the failure paths.

When you own the whole system yourself, the calculus inverts. There is no on-call budget. There is no rotation. If the system wakes you up, it is your problem to fix permanently, because you are the only person on the rotation and you would like to sleep through the night.

That constraint, as uncomfortable as it sounds, produces better engineering than any SLA review I ever sat in.

If you are a developer who is tired of being paged for other people's brittle systems and wants to build something you actually own end to end, the Predict & Profit source code includes the full watchdog, retry, and logging architecture described here.

Get the Source Code — $67 — use code REDDIT for 15% off.

Read the Ebook on Amazon

The 3 AM Page: Why I'd Rather Build a System That Handles Its Own Failures

What "Handle It" Actually Means

The Alpaca Reconnection Problem

Systemd as the Outer Loop

The Log Is the Postmortem

What It Cost Me

Related Reading