Alpaca API TCP Timeouts: How I Fixed the Silent Connection Drops Killing My Stock Bot
TL;DR / Key Takeaways
- Alpaca's WebSocket streaming connection will silently drop after periods of low activity or network hiccups, and your bot will not crash — it will just stop receiving data.
- The standard
websocket-clientlibrary does not auto-reconnect by default. You have to build that yourself. - A background watchdog thread that checks the last-received heartbeat timestamp is the correct pattern for detecting a dead connection without blocking the main execution loop.
- Setting explicit
ping_intervalandping_timeoutvalues on the WebSocket connection prevents most silent drops before they happen.
When I first built the Alpaca stock bot, I tested it for a few hours and everything looked fine. Trades were executing, SQLite logs were clean, P&L looked reasonable. I let it run overnight and checked in the morning. Zero trades after 2 AM.
No error in the log. No exception. No crash. The process was running and healthy by every external measure. It had just quietly stopped doing anything useful.
This is the most dangerous kind of failure in an automated trading system: the one that looks like it is working.
What Actually Happened
The Alpaca bot connects to Alpaca's data streaming API over WebSocket to receive real-time price data and order updates. WebSocket is built on top of TCP. TCP connections have no inherent keepalive at the application layer unless you explicitly configure one. A connection that sits idle for long enough will be timed out by a network device somewhere between your server and Alpaca's infrastructure. Firewalls, NAT gateways, load balancers — any of them can do it.
When the TCP connection drops, the websocket-client library I was using does not raise an exception immediately. The read loop just stops producing messages. If you are not explicitly checking for silence, the bot sits there waiting for data that will never come.
After that first overnight failure I added logging to the data receive handler. On the next failure I could see exactly what happened: the last message timestamp was 2:07 AM, and nothing arrived after that for six hours. The bot did not notice.
Step 1: Explicit Ping Configuration
The simplest fix is to turn on WebSocket-level pings. The WebSocket protocol includes a built-in ping/pong mechanism. If you enable it, the client sends periodic ping frames and expects pong responses. If no pong comes back within the timeout window, the connection is considered dead and the library raises an exception.
# predictandprofit.io
import websocket
def on_message(ws, message):
# process message, update last_received timestamp
pass
def on_error(ws, error):
print(f"WebSocket error: {error}")
def on_close(ws, close_status_code, close_msg):
print(f"Connection closed: {close_status_code} {close_msg}")
def on_open(ws):
print("Connection established")
ws = websocket.WebSocketApp(
"wss://stream.data.alpaca.markets/v2/iex",
on_message=on_message,
on_error=on_error,
on_close=on_close,
on_open=on_open,
)
ws.run_forever(
ping_interval=30, # send a ping every 30 seconds
ping_timeout=10, # if no pong within 10 seconds, close the connection
)
With ping_interval=30 and ping_timeout=10, the connection will detect a dead link within 40 seconds instead of sitting silently for hours. When the timeout triggers, on_error fires and on_close fires, and you have a hook to reconnect.
This alone fixed the majority of overnight drops. But it does not handle the case where the connection is technically alive but the data stream has stopped — which happens when Alpaca's feed pauses during low-activity market hours.
Step 2: A Heartbeat Watchdog Thread
For full coverage, I run a background watchdog thread that monitors the timestamp of the last received message. If no message has arrived in a configurable window, the watchdog forces a reconnect regardless of whether the WebSocket reports itself as alive.
# predictandprofit.io
import threading
import time
import websocket
class AlpacaStreamingClient:
def __init__(self, url: str, auth_headers: dict, reconnect_after_seconds: int = 120):
self.url = url
self.auth_headers = auth_headers
self.reconnect_after_seconds = reconnect_after_seconds
self.last_received = time.time()
self.ws = None
self._watchdog_thread = None
self._running = False
def _on_message(self, ws, message):
self.last_received = time.time()
# process message here
def _on_error(self, ws, error):
print(f"Stream error: {error}")
def _on_close(self, ws, status, msg):
print(f"Stream closed: {status}")
if self._running:
self._reconnect()
def _on_open(self, ws):
print("Stream connected")
self._subscribe()
def _subscribe(self):
# send subscription message to Alpaca stream
import json
self.ws.send(json.dumps({
"action": "subscribe",
"trades": ["AAPL", "MSFT"], # example tickers
}))
def _connect(self):
self.ws = websocket.WebSocketApp(
self.url,
header=self.auth_headers,
on_message=self._on_message,
on_error=self._on_error,
on_close=self._on_close,
on_open=self._on_open,
)
self.ws.run_forever(ping_interval=30, ping_timeout=10)
def _reconnect(self):
time.sleep(5) # brief pause before reconnect
print("Reconnecting to Alpaca stream...")
self._connect()
def _watchdog(self):
while self._running:
time.sleep(15)
age = time.time() - self.last_received
if age > self.reconnect_after_seconds:
print(f"Watchdog: no data for {age:.0f}s. Forcing reconnect.")
if self.ws:
self.ws.close()
# on_close will trigger _reconnect
def start(self):
self._running = True
self._watchdog_thread = threading.Thread(
target=self._watchdog,
daemon=True,
)
self._watchdog_thread.start()
self._connect()
def stop(self):
self._running = False
if self.ws:
self.ws.close()
The watchdog checks every 15 seconds. If last_received is more than reconnect_after_seconds old, it closes the WebSocket. The on_close handler calls _reconnect(), which starts a fresh connection.
This covers the case where the TCP connection is alive and pings are succeeding but no actual trade data is arriving. That situation does not trigger the ping timeout — it looks like a healthy connection with no activity. The watchdog catches it.
Step 3: Exponential Backoff on Reconnects
The naive reconnect loop I showed above just waits 5 seconds and tries again. That is fine for a single failure. If Alpaca's streaming endpoint is having a bad moment and rejecting connections, you want to back off so you are not hammering the API every 5 seconds.
# predictandprofit.io
import time
import random
def reconnect_with_backoff(connect_fn, max_attempts: int = 10):
attempt = 0
while attempt < max_attempts:
try:
connect_fn()
return # successful connection, exit
except Exception as e:
wait = min(2 ** attempt + random.uniform(0, 1), 120)
print(f"Reconnect attempt {attempt + 1} failed: {e}. Waiting {wait:.1f}s.")
time.sleep(wait)
attempt += 1
raise RuntimeError("Max reconnect attempts reached. Giving up.")
The backoff sequence: 1s, 2s, 4s, 8s, 16s, 32s, 64s, capped at 120s. Each step has a small random jitter to prevent multiple instances from all hammering the endpoint in sync. After 10 failures the bot raises and logs a critical error, which surfaces in the SQLite log and triggers an alert.
The Specific Failure Mode I Saw in Practice
On my home network, the specific failure pattern was:
- No market activity between midnight and 6 AM.
- The streaming connection goes quiet — Alpaca sends no trade events because nothing is happening.
- After roughly 90 minutes of quiet, the NAT gateway in my router drops the TCP session from its connection table.
- Alpaca sends the next data point. The packet goes nowhere. No error on either side.
- From the bot's perspective: waiting for data. From Alpaca's perspective: the client dropped.
The ping mechanism only detects this if the application-layer ping fires before the NAT timeout. At ping_interval=30, a ping fires every 30 seconds and keeps the NAT table entry alive. This is why the interval matters: it has to be shorter than your router's NAT idle timeout. Most consumer routers default to 5 minutes; some enterprise gear goes as low as 2 minutes. 30 seconds covers both.
If you are running on AWS or a cloud VPS, the same problem exists at the VPC security group and load balancer layer. The fix is the same: keep pings frequent enough to prevent idle-timeout eviction.
What This Looks Like in the Logs
After adding the watchdog and ping configuration, the SQLite log now has a connection_events table:
2026-04-22 02:07:44 | PING_SENT
2026-04-22 02:07:44 | PONG_RECEIVED
2026-04-22 02:08:14 | PING_SENT
2026-04-22 02:08:14 | PONG_RECEIVED
...
The reconnect has only triggered twice in the past month. Both times it reconnected cleanly within 10 seconds, subscribed to the stream again, and resumed trading without manual intervention. The bot ran through both events without me knowing until I checked the logs the next day.
That is the goal. The failure should be boring and automatic, not a 2 AM alert requiring a manual restart.
The full Alpaca bot source code, including the streaming client, watchdog pattern, and SQLite logging layer, is part of the Predict & Profit dual-bot package.
How It Works — Full System Overview
Frequently Asked Questions
Q: Why use WebSocket instead of polling the REST API?
A: REST polling introduces latency proportional to your polling interval. For intraday trading decisions that depend on recent price action, WebSocket streaming gives you sub-second data delivery with no polling overhead.
Q: Does the watchdog thread cause race conditions?
A: The watchdog only reads last_received and calls ws.close(). The WebSocket library handles the close safely. The only shared mutable state is last_received, which is written by the message handler and read by the watchdog. In Python, a timestamp assignment is atomic at the interpreter level, so no explicit lock is needed for this single value.
Q: What if the bot misses trades during a reconnect?
A: During the reconnect window (typically under 10 seconds), trade events are not received. For a bot trading on 5-minute or longer intervals, this is acceptable. For a high-frequency system with sub-second requirements, you would need a more sophisticated gap-fill mechanism, which is outside the scope of this setup.