Monitoring
On this page
- Overview
- Prometheus Metrics
- Accessing Metrics
- Core Metrics
- Prometheus Configuration
- Example Queries
- Structured Logging
- Log Levels
- Configuration
- Log Aggregation
- Distributed Tracing
- Trace Spans
- Jaeger Configuration
- Health Checks
- Liveness Check
- Readiness Check
- Status Endpoint
- Alerting
- Recommended Alerts
- Dashboards
- Grafana Dashboard
- CLI Dashboard
- Performance Profiling
- CPU Profiling
- Memory Profiling
- Compliance Auditing
- Monitoring Best Practices
- 1. Set Up Alerts
- 2. Monitor Projection Lag
- 3. Track View Change Rate
- 4. Watch Write Latency
- 5. Log Everything
- Troubleshooting Metrics
- Related Documentation
Monitor Kimberlite clusters in production.
Overview
Kimberlite provides comprehensive observability through:
- Prometheus metrics - Performance and health metrics
- Structured logging - JSON logs with request tracing
- OpenTelemetry tracing - Distributed request traces
- Health checks - HTTP endpoints for load balancers
Prometheus Metrics
Kimberlite exposes Prometheus-compatible metrics on the configured metrics endpoint (default: :9090).
Accessing Metrics
# Scrape metrics
# Example output
# TYPE kmb_log_entries_total counter
# TYPE kmb_log_bytes_total counter
Core Metrics
Log Metrics:
| Metric | Type | Description |
|---|---|---|
kmb_log_entries_total | Counter | Total log entries written |
kmb_log_bytes_total | Counter | Total log bytes written |
kmb_log_size_bytes | Gauge | Current log size |
Consensus Metrics:
| Metric | Type | Description |
|---|---|---|
kmb_consensus_commits_total | Counter | Total committed entries |
kmb_consensus_view | Gauge | Current consensus view number |
kmb_consensus_leader | Gauge | Current leader node ID |
kmb_consensus_view_changes_total | Counter | Total view changes |
Projection Metrics:
| Metric | Type | Description |
|---|---|---|
kmb_projection_applied_position | Gauge | Last applied log position |
kmb_projection_lag | Gauge | Log position lag (head - applied) |
Query Performance:
| Metric | Type | Description |
|---|---|---|
kmb_query_duration_seconds | Histogram | Query latency distribution |
kmb_write_duration_seconds | Histogram | Write latency distribution |
kmb_requests_total | Counter | Total requests by method and status |
kmb_requests_in_flight | Gauge | Currently processing requests |
Tenant Metrics:
| Metric | Type | Description |
|---|---|---|
kmb_active_tenants | Gauge | Number of active tenants |
kmb_tenant_log_entries_total | Counter | Entries per tenant (labeled by tenant_id) |
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kimberlite'
static_configs:
- targets:
- 'kimberlite-node1:9090'
- 'kimberlite-node2:9090'
- 'kimberlite-node3:9090'
relabel_configs:
- source_labels:
target_label: instance
Example Queries
Throughput:
# Write throughput (ops/sec)
rate(kmb_log_entries_total[1m])
# Write bandwidth (MB/sec)
rate(kmb_log_bytes_total[1m]) / 1024 / 1024
Latency:
# P95 write latency
histogram_quantile(0.95, rate(kmb_write_duration_seconds_bucket[5m]))
# P99 query latency
histogram_quantile(0.99, rate(kmb_query_duration_seconds_bucket[5m]))
Health:
# Projection lag (should be near 0)
kmb_projection_lag
# View changes (spikes indicate instability)
rate(kmb_consensus_view_changes_total[5m])
# Leader stability (should be constant)
kmb_consensus_leader
Structured Logging
Kimberlite emits structured JSON logs for machine parsing:
Log Levels
| Level | Usage | Typical Volume |
|---|---|---|
ERROR | Errors requiring attention | <1/min |
WARN | Unexpected but recoverable | <10/min |
INFO | Normal operational events | 10-100/min |
DEBUG | Detailed debug information | 100-1000/min |
TRACE | Very verbose (development only) | >1000/min |
Configuration
# Set log level via environment variable
# Per-module logging
# JSON output (default)
Log Aggregation
With Promtail + Loki:
# promtail-config.yml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kimberlite
static_configs:
- targets:
- localhost
labels:
job: kimberlite
__path__: /var/log/kimberlite/*.log
With Fluent Bit:
[INPUT]
Name tail
Path /var/log/kimberlite/*.log
Parser json
[OUTPUT]
Name es
Host elasticsearch
Port 9200
Index kimberlite
Distributed Tracing
Enable OpenTelemetry tracing for request-level observability:
# config.toml
[telemetry]
tracing_enabled = true
tracing_endpoint = "http://jaeger:14268/api/traces"
Trace Spans
Kimberlite automatically creates spans for:
kmb.write- Client write path (Prepare → Commit → Apply)kmb.query- Query executionkmb.consensus.prepare- Consensus prepare phasekmb.consensus.commit- Consensus commit phasekmb.repair- Log repair operationskmb.view_change- View change protocol
Jaeger Configuration
# docker-compose.yml
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "14268:14268" # HTTP collector
environment:
- COLLECTOR_OTLP_ENABLED=true
Access Jaeger UI at http://localhost:16686
Health Checks
Kimberlite provides HTTP health check endpoints for load balancers:
Liveness Check
# 200 OK: Process is running
# 503 Service Unavailable: Process is shutting down
Use for: Kubernetes liveness probes, restart decisions
Readiness Check
# 200 OK: Replica is ready to serve traffic
# 503 Service Unavailable: Replica is not ready (recovering, view change)
Use for: Kubernetes readiness probes, load balancer targets
Status Endpoint
Alerting
Recommended Alerts
Critical Alerts (page immediately):
# Cluster lost quorum
- alert: ClusterNoQuorum
expr: sum(kmb_consensus_leader) == 0
for: 30s
annotations:
summary: "Cluster has no leader"
# High error rate
- alert: HighErrorRate
expr: rate(kmb_requests_total{status="error"}[5m]) > 10
for: 5m
annotations:
summary: "Error rate > 10/sec"
# Projection lag growing
- alert: ProjectionLagHigh
expr: kmb_projection_lag > 1000
for: 5m
annotations:
summary: "Projection lagging by >1000 entries"
Warning Alerts (investigate):
# Frequent view changes
- alert: FrequentViewChanges
expr: rate(kmb_consensus_view_changes_total[15m]) > 0.1
for: 15m
annotations:
summary: "View changes > 1 per 10 minutes"
# High write latency
- alert: HighWriteLatency
expr: histogram_quantile(0.99, rate(kmb_write_duration_seconds_bucket[5m])) > 0.1
for: 10m
annotations:
summary: "P99 write latency > 100ms"
Dashboards
Grafana Dashboard
Import the official Kimberlite dashboard:
Key Panels:
- Write throughput (ops/sec)
- P50/P95/P99 latency
- Active tenants
- Projection lag
- View change history
- Error rates
CLI Dashboard
Use the vopr dashboard command for live metrics during testing:
# Web dashboard
# Terminal dashboard
Performance Profiling
CPU Profiling
# Install pprof
# Profile for 30 seconds
# Generate flamegraph
Memory Profiling
# Heap snapshot
Compliance Auditing
For HIPAA/SOC 2 compliance, enable audit log export:
# Export audit log for date range
See Audit Trails for audit log queries.
Monitoring Best Practices
1. Set Up Alerts
# Alert on fundamentals, not symptoms
- alert: HighErrorRate # ✅ Good
- alert: CPUHigh # ❌ Bad (symptom, not root cause)
2. Monitor Projection Lag
# Should stay near 0 in steady state
kmb_projection_lag < 100
3. Track View Change Rate
# Frequent view changes indicate network issues or crashes
rate(kmb_consensus_view_changes_total[1h]) < 0.01
4. Watch Write Latency
# P99 latency should stay under SLA
histogram_quantile(0.99, rate(kmb_write_duration_seconds_bucket[5m])) < 0.1
5. Log Everything
# Ship logs to centralized aggregation
Troubleshooting Metrics
| Symptom | Metric to Check | Likely Cause |
|---|---|---|
| Slow writes | kmb_write_duration_seconds P99 | Disk I/O, network latency |
| Frequent view changes | kmb_consensus_view_changes_total | Network partition, node crashes |
| Growing projection lag | kmb_projection_lag | CPU bottleneck, slow queries |
| No leader | kmb_consensus_leader == 0 | Cluster lost quorum |
See Troubleshooting Guide for detailed debugging.
Related Documentation
- Configuration Guide - Configure telemetry endpoints
- Deployment Guide - Deploy monitoring stack
- Troubleshooting Guide - Debug production issues
- Instrumentation Design - Technical details
Key Takeaway: Monitor write latency, projection lag, and view change rate. Alert on loss of quorum and high error rates. Export logs for compliance auditing.