Kimberlite Agent Protocol
On this page
This document describes the protocol for communication between Kimberlite cluster agents and control plane systems.
Overview
The agent protocol enables:
- Health monitoring: Agents report node status via heartbeats
- Configuration management: Control planes push configuration updates
- Observability: Metrics and logs are streamed from agents
- Administration: Control planes can issue administrative commands
- Authentication: Secure agent identity verification
- Flow control: Backpressure handling for high-volume data
- Health checks: On-demand health verification
Transport
The protocol uses WebSocket connections with JSON-encoded messages. Agents connect to the control plane endpoint and maintain a persistent connection.
Agent Control Plane
| |
|-------- WebSocket Connect ------>|
| |
|<------- AuthChallenge -----------|
|-------- AuthResponse ----------->|
| |
|<------- HeartbeatRequest --------|
|-------- Heartbeat ------------->|
| |
|-------- MetricsBatch ---------->|
|-------- LogsBatch ------------->|
| |
|<------- ConfigUpdate -----------|
|-------- ConfigAck ------------->|
| |
|<------- AdminCommand -----------|
|-------- ControlAck ------------>|
| |
|<------- HealthCheck ------------|
|-------- HealthCheckResponse --->|
| |
|<------- FlowControl ------------|
| |
Message Types
Agent → Control Plane
Heartbeat
Periodic status update sent by the agent.
Fields:
| Field | Type | Description |
|---|---|---|
node_id | string | Unique identifier for the node |
status | enum | healthy, degraded, unhealthy, starting, stopping |
role | enum | leader, follower, candidate, learner |
resources | object | Current resource utilization |
replication | object? | Replication status (followers only) |
buffer_stats | array | Buffer statistics for backpressure monitoring |
Replication object:
| Field | Type | Description |
|---|---|---|
leader_id | string | ID of the leader being replicated from |
lag_ms | u64 | Replication lag in milliseconds |
pending_entries | u64 | Number of entries waiting to replicate |
Buffer stats object:
| Field | Type | Description |
|---|---|---|
stream_type | enum | heartbeats, metrics, logs, all |
state | enum | empty, normal, high, critical |
pending_items | u64 | Items currently buffered |
capacity | u64 | Maximum buffer capacity |
dropped_count | u64 | Items dropped since last report |
oldest_item_age_ms | u64 | Age of oldest buffered item |
MetricsBatch
Batch of collected metric samples.
Metric sample fields:
| Field | Type | Description |
|---|---|---|
name | string | Metric name (e.g., kmb.writes.total) |
value | f64 | Metric value |
timestamp_ms | u64 | Unix timestamp in milliseconds |
labels | array | Optional key-value pairs |
LogsBatch
Batch of log entries.
Log entry fields:
| Field | Type | Description |
|---|---|---|
timestamp_ms | u64 | Unix timestamp in milliseconds |
level | enum | trace, debug, info, warn, error |
message | string | Log message |
fields | array | Optional structured fields |
ConfigAck
Acknowledgment of a configuration update.
Fields:
| Field | Type | Description |
|---|---|---|
version | u64 | Configuration version being acknowledged |
success | bool | Whether configuration was applied |
error | string? | Error message if failed |
ControlAck
Acknowledgment of a control message (AdminCommand, etc.).
Fields:
| Field | Type | Description |
|---|---|---|
message_id | u64 | ID of the message being acknowledged |
success | bool | Whether the command succeeded |
error | string? | Error message if failed |
result | string? | JSON-encoded command-specific result |
duration_ms | u64 | Time taken to execute the command |
AuthResponse
Response to authentication challenge.
Credential types:
| Type | Fields | Description |
|---|---|---|
bearer | token | JWT or API key |
pre_shared_key | key_id, signature | HMAC signature of challenge |
certificate | fingerprint | SHA-256 certificate fingerprint |
HealthCheckResponse
Response to a health check request.
Fields:
| Field | Type | Description |
|---|---|---|
request_id | u64 | Correlation ID from request |
status | enum | healthy, degraded, unhealthy |
checks | array | Individual check results |
duration_ms | u64 | Time taken for all checks |
Control Plane → Agent
ConfigUpdate
Push new configuration to the agent.
Fields:
| Field | Type | Description |
|---|---|---|
message_id | u64? | Optional correlation ID for ack |
version | u64 | Configuration version |
config | string | JSON-encoded configuration |
checksum | string | Integrity checksum |
The agent should verify the checksum before applying the configuration and respond with a ConfigAck.
AdminCommand
Execute an administrative command.
Available commands:
| Command | Fields | Description |
|---|---|---|
take_snapshot | - | Trigger a state snapshot |
compact_log | up_to_offset | Compact log up to offset |
step_down | - | Step down from leader role |
transfer_leadership | target_node_id | Transfer to target |
pause_replication | - | Pause replication for maintenance |
resume_replication | - | Resume replication |
The agent should respond with a ControlAck containing the message_id.
HeartbeatRequest
Request an immediate heartbeat from the agent.
Shutdown
Request graceful shutdown.
AuthChallenge
Authentication challenge sent after connection.
Fields:
| Field | Type | Description |
|---|---|---|
challenge | string | Random challenge for PSK auth |
supported_methods | array | Supported auth methods |
expires_in_ms | u64 | Challenge expiration time |
FlowControl
Backpressure signal for high-volume data.
Signal types:
| Signal | Fields | Description |
|---|---|---|
resume | - | Resume normal transmission |
slow_down | min_interval_ms | Reduce transmission rate |
pause | - | Stop transmission |
HealthCheck
Request health check from agent.
Check types:
| Type | Description |
|---|---|
liveness | Basic process health |
storage | Storage subsystem health |
replication | Replication status |
resources | Disk/memory availability |
all | All available checks |
Connection Lifecycle
Authentication
After WebSocket connection is established:
- Control plane sends
AuthChallenge - Agent responds with
AuthResponsecontaining credentials - Control plane validates credentials
- On success, connection transitions to authenticated state
Initial Handshake
- Agent connects to WebSocket endpoint
- Authentication exchange (see above)
- Control plane sends
HeartbeatRequest - Agent responds with
Heartbeat - Connection is established
Steady State
- Agent sends
Heartbeatevery 10 seconds (configurable) - Agent batches and sends
MetricsBatchevery 5 seconds - Agent batches and sends
LogsBatchevery 5 seconds - Control plane pushes
ConfigUpdateas needed - Agent includes
buffer_statsin heartbeats for backpressure monitoring - Control plane sends
FlowControlwhen overwhelmed
Reconnection with Exponential Backoff
If the connection drops, agents should:
- Wait with exponential backoff (initial: 1s, max: 60s, multiplier: 2.0)
- Add jitter (±25%) to prevent thundering herd
- Reconnect and perform full authentication handshake
- Resume normal operation
Backoff Configuration:
BackoffConfig
Connection States
| State | Description |
|---|---|
disconnected | Not connected |
backoff | Waiting before reconnect |
connecting | Connection in progress |
connected | Connected, not authenticated |
authenticated | Ready for normal operation |
closing | Graceful shutdown in progress |
Health Monitoring
The control plane monitors agent health using these thresholds:
| Metric | Warning | Critical |
|---|---|---|
| Heartbeat timeout | - | 30 seconds |
| Replication lag | 5 seconds | 30 seconds |
| Disk usage | 80% | 95% |
| Memory usage | 85% | 95% |
Using the Protocol
Rust Crate
The kimberlite-agent-protocol crate provides typed definitions:
use ;
let heartbeat = Heartbeat ;
let json = to_string?;
Other Languages
The protocol uses standard JSON, so any language with JSON support can implement an agent or control plane. The type definitions in this document serve as the canonical specification.
Versioning
The protocol version is negotiated during the WebSocket handshake via the Sec-WebSocket-Protocol header:
Sec-WebSocket-Protocol: kimberlite-agent-protocol-v1
Breaking changes will increment the version number.