Peer Swarm Heartbeat
Internal endpoint that tracks distributed OpenClaw gateway nodes across multiple machines, enabling the Intelligence Dashboard to aggregate swarm activity without direct LAN access.
Refs #3340
Problem
The AINative platform runs OpenClaw agent swarms on multiple physical machines (laptops, servers, cloud instances). Railway-hosted services cannot reach these machines over LAN because Railway's private network is isolated from local networks. The heartbeat system solves this by having each peer push its status to the centralized backend.
Architecture
+---------------------+ HTTPS POST +--------------------+
| Peer Machine A | ----------------------> | Railway Backend |
| (OpenClaw Gateway) | /peer-swarm-heartbeat | (FastAPI) |
+---------------------+ | |
| peer_swarm_ |
+---------------------+ HTTPS POST | heartbeat table |
| Peer Machine B | ----------------------> | |
| (OpenClaw Gateway) | /peer-swarm-heartbeat +--------+-----------+
+---------------------+ |
| reads
v
+--------------------+
| Intelligence |
| Dashboard |
| (aggregated view) |
+--------------------+
Each peer machine runs a peer-swarm-watchdog cron every 5 minutes. The cron:
- Calls
openclaw gateway call health --jsonon the local LAN gateway - Extracts agent count and loop count from the response
- POSTs the data to the Railway backend via the public URL
The backend upserts the data into the peer_swarm_heartbeat table. The Intelligence Dashboard reads from this table to show aggregate swarm metrics.
Endpoint
POST /api/v1/internal/peer-swarm-heartbeat
Records (upserts) the latest heartbeat from a peer OpenClaw gateway node.
Status Code: 204 No Content on success
Authentication: Internal-only. Behind Kong internal route -- no user-facing auth required.
Request Body
| Field | Type | Required | Description |
|---|---|---|---|
peer_id | string | Yes | Unique identifier for the peer node (e.g., tobys-macbook-air) |
agents_active | integer | Yes | Number of active agents reported by the peer gateway. Must be >= 0. |
improvement_loops | integer | No | Session/loop count from the peer gateway. Defaults to 0. Must be >= 0. |
gateway_version | string|null | No | OpenClaw version string (e.g., 0.8.54). Defaults to null. |
Database Table
CREATE TABLE peer_swarm_heartbeat (
peer_id TEXT PRIMARY KEY,
agents_active INTEGER NOT NULL DEFAULT 0,
improvement_loops INTEGER NOT NULL DEFAULT 0,
gateway_version TEXT,
last_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_peer_swarm_heartbeat_last_seen
ON peer_swarm_heartbeat(last_seen_at DESC);
The peer_id is the primary key. Each POST performs an INSERT ... ON CONFLICT (peer_id) DO UPDATE, so the table always holds one row per peer with the latest values.
Upsert Behavior
- If a peer with the given
peer_iddoes not exist, a new row is inserted. - If the peer already exists, all fields (
agents_active,improvement_loops,gateway_version,last_seen_at) are overwritten with the new values. last_seen_atis always set to the current UTC timestamp by the server, not by the caller.
Staleness Detection
The system considers a peer offline if its last_seen_at timestamp is older than the staleness cutoff:
PEER_HEARTBEAT_MAX_AGE_SECS = 600 (10 minutes)
This value is configurable via the PEER_HEARTBEAT_MAX_AGE_SECS environment variable.
Since the watchdog cron runs every 5 minutes, a peer that misses two consecutive heartbeats is considered offline. When the Intelligence Dashboard queries the table, it applies a WHERE last_seen_at >= :cutoff filter, so stale peers are excluded from all aggregate metrics.
What happens when a peer goes offline
- Its
agents_activeandimprovement_loopsare no longer included in the dashboard totals. - The peer row remains in the table (it is not deleted). If the peer comes back online, the next heartbeat upsert updates the row and it becomes visible again.
Intelligence Dashboard Aggregation
The GET /api/v1/public/platform/intelligence endpoint reads peer swarm data through two helper functions:
_get_peer_swarm_stats(db)
Aggregates across all online peers:
SELECT COALESCE(SUM(agents_active), 0) AS total_agents,
COALESCE(SUM(improvement_loops), 0) AS total_loops,
COUNT(*) AS online_peers
FROM peer_swarm_heartbeat
WHERE last_seen_at >= :cutoff
The returned agents_active is added to the dashboard's total agents_active count (alongside heartbeat-based and Agent Cloud counts). The improvement_loops are added to the dashboard's total improvement_loops.
_get_peer_swarm_events(db)
Generates per-agent event cards from the agent roster and run log, plus machine-level summary cards from the heartbeat table. Each online peer produces an AgentEvent card that shows:
- The peer name (formatted from
peer_id, e.g., "Tobys Macbook Air") - The number of active agents (e.g., "running 12 agents")
- How recently the heartbeat was received
_get_tasks_completed_today(db)
The tasks_completed_today metric on the dashboard includes peer swarm data:
SELECT COALESCE(SUM(improvement_loops), 0)
FROM peer_swarm_heartbeat
WHERE last_seen_at >= :cutoff
Each improvement_loops count represents completed watchdog/cron cycles, so this supplements the formal agent_run_log task count.
Example Requests
Basic heartbeat
curl -X POST https://ainative-browser-builder.up.railway.app/api/v1/internal/peer-swarm-heartbeat \
-H "Content-Type: application/json" \
-d '{
"peer_id": "tobys-macbook-air",
"agents_active": 12,
"improvement_loops": 47,
"gateway_version": "0.8.54"
}'
Response: 204 No Content
Minimal heartbeat (no optional fields)
curl -X POST https://ainative-browser-builder.up.railway.app/api/v1/internal/peer-swarm-heartbeat \
-H "Content-Type: application/json" \
-d '{
"peer_id": "build-server-01",
"agents_active": 4
}'
Response: 204 No Content
Querying the dashboard to see aggregated data
curl -s https://ainative-browser-builder.up.railway.app/api/v1/public/platform/intelligence \
| python3 -m json.tool
The response includes stats.agents_active and stats.improvement_loops, both of which incorporate peer swarm data from all online peers.
Watchdog Cron Setup
The peer-swarm-watchdog is typically configured as a system cron on each machine:
*/5 * * * * /path/to/peer-swarm-watchdog.sh
The script collects local gateway health data and POSTs it to the Railway backend. Since the local machine can reach the LAN gateway (which Railway cannot), this push-based design bridges the network gap.
Related Files
| File | Purpose |
|---|---|
src/backend/app/api/internal/peer_swarm_heartbeat.py | Endpoint implementation |
src/backend/app/api/v1/endpoints/platform_intelligence.py | Dashboard aggregation logic |
scripts/sync-production-schema.py | Table creation (search for peer_swarm_heartbeat) |