Skip to main content

Peer Swarm Heartbeat

Internal endpoint that tracks distributed OpenClaw gateway nodes across multiple machines, enabling the Intelligence Dashboard to aggregate swarm activity without direct LAN access.

Refs #3340


Problem

The AINative platform runs OpenClaw agent swarms on multiple physical machines (laptops, servers, cloud instances). Railway-hosted services cannot reach these machines over LAN because Railway's private network is isolated from local networks. The heartbeat system solves this by having each peer push its status to the centralized backend.

Architecture

+---------------------+       HTTPS POST        +--------------------+
| Peer Machine A | ----------------------> | Railway Backend |
| (OpenClaw Gateway) | /peer-swarm-heartbeat | (FastAPI) |
+---------------------+ | |
| peer_swarm_ |
+---------------------+ HTTPS POST | heartbeat table |
| Peer Machine B | ----------------------> | |
| (OpenClaw Gateway) | /peer-swarm-heartbeat +--------+-----------+
+---------------------+ |
| reads
v
+--------------------+
| Intelligence |
| Dashboard |
| (aggregated view) |
+--------------------+

Each peer machine runs a peer-swarm-watchdog cron every 5 minutes. The cron:

  1. Calls openclaw gateway call health --json on the local LAN gateway
  2. Extracts agent count and loop count from the response
  3. POSTs the data to the Railway backend via the public URL

The backend upserts the data into the peer_swarm_heartbeat table. The Intelligence Dashboard reads from this table to show aggregate swarm metrics.


Endpoint

POST /api/v1/internal/peer-swarm-heartbeat

Records (upserts) the latest heartbeat from a peer OpenClaw gateway node.

Status Code: 204 No Content on success

Authentication: Internal-only. Behind Kong internal route -- no user-facing auth required.

Request Body

FieldTypeRequiredDescription
peer_idstringYesUnique identifier for the peer node (e.g., tobys-macbook-air)
agents_activeintegerYesNumber of active agents reported by the peer gateway. Must be >= 0.
improvement_loopsintegerNoSession/loop count from the peer gateway. Defaults to 0. Must be >= 0.
gateway_versionstring|nullNoOpenClaw version string (e.g., 0.8.54). Defaults to null.

Database Table

CREATE TABLE peer_swarm_heartbeat (
peer_id TEXT PRIMARY KEY,
agents_active INTEGER NOT NULL DEFAULT 0,
improvement_loops INTEGER NOT NULL DEFAULT 0,
gateway_version TEXT,
last_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_peer_swarm_heartbeat_last_seen
ON peer_swarm_heartbeat(last_seen_at DESC);

The peer_id is the primary key. Each POST performs an INSERT ... ON CONFLICT (peer_id) DO UPDATE, so the table always holds one row per peer with the latest values.

Upsert Behavior

  • If a peer with the given peer_id does not exist, a new row is inserted.
  • If the peer already exists, all fields (agents_active, improvement_loops, gateway_version, last_seen_at) are overwritten with the new values.
  • last_seen_at is always set to the current UTC timestamp by the server, not by the caller.

Staleness Detection

The system considers a peer offline if its last_seen_at timestamp is older than the staleness cutoff:

PEER_HEARTBEAT_MAX_AGE_SECS = 600  (10 minutes)

This value is configurable via the PEER_HEARTBEAT_MAX_AGE_SECS environment variable.

Since the watchdog cron runs every 5 minutes, a peer that misses two consecutive heartbeats is considered offline. When the Intelligence Dashboard queries the table, it applies a WHERE last_seen_at >= :cutoff filter, so stale peers are excluded from all aggregate metrics.

What happens when a peer goes offline

  • Its agents_active and improvement_loops are no longer included in the dashboard totals.
  • The peer row remains in the table (it is not deleted). If the peer comes back online, the next heartbeat upsert updates the row and it becomes visible again.

Intelligence Dashboard Aggregation

The GET /api/v1/public/platform/intelligence endpoint reads peer swarm data through two helper functions:

_get_peer_swarm_stats(db)

Aggregates across all online peers:

SELECT COALESCE(SUM(agents_active), 0) AS total_agents,
COALESCE(SUM(improvement_loops), 0) AS total_loops,
COUNT(*) AS online_peers
FROM peer_swarm_heartbeat
WHERE last_seen_at >= :cutoff

The returned agents_active is added to the dashboard's total agents_active count (alongside heartbeat-based and Agent Cloud counts). The improvement_loops are added to the dashboard's total improvement_loops.

_get_peer_swarm_events(db)

Generates per-agent event cards from the agent roster and run log, plus machine-level summary cards from the heartbeat table. Each online peer produces an AgentEvent card that shows:

  • The peer name (formatted from peer_id, e.g., "Tobys Macbook Air")
  • The number of active agents (e.g., "running 12 agents")
  • How recently the heartbeat was received

_get_tasks_completed_today(db)

The tasks_completed_today metric on the dashboard includes peer swarm data:

SELECT COALESCE(SUM(improvement_loops), 0)
FROM peer_swarm_heartbeat
WHERE last_seen_at >= :cutoff

Each improvement_loops count represents completed watchdog/cron cycles, so this supplements the formal agent_run_log task count.


Example Requests

Basic heartbeat

curl -X POST https://ainative-browser-builder.up.railway.app/api/v1/internal/peer-swarm-heartbeat \
-H "Content-Type: application/json" \
-d '{
"peer_id": "tobys-macbook-air",
"agents_active": 12,
"improvement_loops": 47,
"gateway_version": "0.8.54"
}'

Response: 204 No Content

Minimal heartbeat (no optional fields)

curl -X POST https://ainative-browser-builder.up.railway.app/api/v1/internal/peer-swarm-heartbeat \
-H "Content-Type: application/json" \
-d '{
"peer_id": "build-server-01",
"agents_active": 4
}'

Response: 204 No Content

Querying the dashboard to see aggregated data

curl -s https://ainative-browser-builder.up.railway.app/api/v1/public/platform/intelligence \
| python3 -m json.tool

The response includes stats.agents_active and stats.improvement_loops, both of which incorporate peer swarm data from all online peers.


Watchdog Cron Setup

The peer-swarm-watchdog is typically configured as a system cron on each machine:

*/5 * * * * /path/to/peer-swarm-watchdog.sh

The script collects local gateway health data and POSTs it to the Railway backend. Since the local machine can reach the LAN gateway (which Railway cannot), this push-based design bridges the network gap.


FilePurpose
src/backend/app/api/internal/peer_swarm_heartbeat.pyEndpoint implementation
src/backend/app/api/v1/endpoints/platform_intelligence.pyDashboard aggregation logic
scripts/sync-production-schema.pyTable creation (search for peer_swarm_heartbeat)