Kimberlite Design docs

VSR Instrumentation Architecture

On this page

Date: 2026-02-05 Phase: 5 (Observability & Polish) Status: Design Complete

Executive Summary

This document specifies the production instrumentation architecture for Kimberlite VSR. The design provides comprehensive observability through structured metrics, OpenTelemetry integration, and performance profiling hooks.

Key Requirements:

  • <1% performance overhead in production
  • Standard observability formats (OpenTelemetry, Prometheus)
  • Real-time operational visibility
  • Historical performance analysis
  • Compliance audit trail support

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                    VSR Protocol Handlers                     │
│  (replica/normal.rs, view_change.rs, recovery.rs, etc.)    │
└──────────────────┬──────────────────────────────────────────┘
                   │ record_metric()
                   ▼
┌─────────────────────────────────────────────────────────────┐
│                  Instrumentation Layer                       │
│  ┌────────────┐  ┌────────────┐  ┌────────────────────┐   │
│  │ Histograms │  │  Counters  │  │  Gauges            │   │
│  │ (latency)  │  │ (ops/sec)  │  │  (queue depth)     │   │
│  └────────────┘  └────────────┘  └────────────────────┘   │
└──────────────────┬──────────────────────────────────────────┘
                   │ export()
                   ▼
┌─────────────────────────────────────────────────────────────┐
│              OpenTelemetry Exporter                          │
│  ┌────────────┐  ┌────────────┐  ┌────────────────────┐   │
│  │ Prometheus │  │   Jaeger   │  │  Custom Backends   │   │
│  └────────────┘  └────────────┘  └────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Metric Categories

1. Latency Metrics (Histograms)

Track end-to-end latency for critical operations.

MetricDescriptionBuckets (ms)
vsr_prepare_latency_msTime from Prepare send to PrepareOk quorum[0.1, 0.5, 1, 2, 5, 10, 25, 50, 100]
vsr_commit_latency_msTime from PrepareOk quorum to Commit broadcast[0.1, 0.5, 1, 2, 5, 10, 25, 50, 100]
vsr_client_latency_msTotal client request latency (Prepare → Commit → Apply)[1, 5, 10, 25, 50, 100, 250, 500, 1000]
vsr_view_change_latency_msTime to complete view change[10, 50, 100, 250, 500, 1000, 5000]
vsr_recovery_latency_msTime to recover from crash[100, 500, 1000, 5000, 10000, 30000]
vsr_state_transfer_latency_msTime to complete state transfer[1000, 5000, 10000, 30000, 60000]
vsr_repair_latency_msTime to repair single log entry[1, 5, 10, 25, 50, 100]

Implementation:

  • Use logarithmic buckets for wide range coverage
  • P50, P95, P99 automatically calculated
  • Per-replica and cluster-wide aggregation

2. Throughput Metrics (Counters)

Track operation rates and volumes.

MetricDescriptionLabels
vsr_operations_totalTotal operations committed{replica_id, status=success|failure}
vsr_bytes_written_totalTotal bytes written to log{replica_id}
vsr_messages_sent_totalTotal VSR messages sent{replica_id, message_type}
vsr_messages_received_totalTotal VSR messages received{replica_id, message_type}
vsr_byzantine_rejections_totalTotal Byzantine messages rejected{replica_id, reason}
vsr_checksum_failures_totalTotal checksum validation failures{replica_id, component=log|superblock}
vsr_repairs_totalTotal log repair operations{replica_id, status=success|failure}
vsr_view_changes_totalTotal view changes initiated{replica_id, reason}

Implementation:

  • Monotonically increasing counters
  • Rate calculation via PromQL: rate(vsr_operations_total[1m])
  • Per-second, per-minute, per-hour aggregation

3. Health Metrics (Gauges)

Track current cluster state and health.

MetricDescriptionRange
vsr_replica_statusCurrent replica status0=Normal, 1=ViewChange, 2=Recovering, 3=StateTransfer
vsr_view_numberCurrent view number[0, ∞)
vsr_commit_numberCurrent commit number[0, ∞)
vsr_op_numberCurrent op number[0, ∞)
vsr_log_size_bytesLog size in bytes[0, ∞)
vsr_log_entry_countNumber of log entries[0, ∞)
vsr_quorum_sizeRequired quorum size[1, MAX_REPLICAS]
vsr_cluster_sizeCurrent cluster size[1, MAX_REPLICAS]
vsr_replica_lag_operationsOperations behind leader[0, ∞)
vsr_pending_requestsNumber of pending client requests[0, ∞)
vsr_prepare_ok_votesPrepareOk votes for current op[0, cluster_size]

Implementation:

  • Updated on state changes
  • Sampled every 1 second for time-series
  • Alert thresholds configurable

4. Resource Metrics (Gauges)

Track system resource utilization.

MetricDescriptionUnit
vsr_memory_used_bytesMemory allocated by VSRbytes
vsr_network_bandwidth_bytes_per_secNetwork I/O ratebytes/sec
vsr_disk_iopsDisk operations per secondops/sec
vsr_disk_bandwidth_bytes_per_secDisk I/O bandwidthbytes/sec
vsr_cpu_utilization_percentCPU usage by VSR threads[0, 100]

Implementation:

  • System metrics via OS-specific APIs
  • Sampled every 5 seconds
  • Integration with system monitoring tools

5. Phase-Specific Metrics

Clock Synchronization (Phase 1):

MetricDescription
vsr_clock_offset_msEstimated clock offset from leader
vsr_clock_samples_totalTotal clock samples collected
vsr_clock_sync_errors_totalClock synchronization failures

Client Sessions (Phase 1):

MetricDescription
vsr_client_sessions_activeNumber of active client sessions
vsr_client_sessions_evicted_totalTotal sessions evicted
vsr_duplicate_requests_totalDuplicate requests detected

Repair Budgets (Phase 2):

MetricDescription
vsr_repair_budget_availableAvailable repair credits
vsr_repair_ewma_latency_msEWMA latency per replica
vsr_repair_inflight_countNumber of inflight repair requests

Log Scrubbing (Phase 3):

MetricDescription
vsr_scrub_tours_completed_totalTotal scrub tours completed
vsr_scrub_corruptions_detected_totalTotal corruptions detected
vsr_scrub_throughput_ops_per_secScrub throughput

Reconfiguration (Phase 4):

MetricDescription
vsr_reconfig_state0=Stable, 1=Joint
vsr_reconfig_transitions_totalTotal reconfigurations
vsr_cluster_version_majorCluster software version (major)
vsr_cluster_version_minorCluster software version (minor)

Standby Replicas (Phase 4):

MetricDescription
vsr_standby_countNumber of registered standbys
vsr_standby_healthy_countNumber of healthy standbys
vsr_standby_lag_operationsStandby lag behind cluster
vsr_standby_promotions_totalTotal standby promotions

OpenTelemetry Integration

Exporter Configuration

// Example: Configure OTLP exporter
let exporter = opentelemetry_otlp::new_exporter()
    .with_endpoint("http://otel-collector:4317")
    .with_protocol(Protocol::Grpc)
    .with_timeout(Duration::from_secs(5));

let meter = opentelemetry::global::meter("kimberlite-vsr");

Supported Backends

  1. Prometheus (pull-based)

    • Expose /metrics endpoint
    • Prometheus scrapes every 15 seconds
    • Standard Prometheus exposition format
  2. OTLP (push-based)

    • Push to OpenTelemetry Collector
    • Batch export every 10 seconds
    • Supports Jaeger, Zipkin, etc.
  3. StatsD (push-based)

    • UDP datagram export
    • Low overhead, fire-and-forget
    • Integration with Datadog, Grafana Cloud

Metric Export Format

# HELP vsr_operations_total Total operations committed
# TYPE vsr_operations_total counter
vsr_operations_total{replica_id="0",status="success"} 12345
vsr_operations_total{replica_id="1",status="success"} 12340

# HELP vsr_prepare_latency_ms Time from Prepare send to PrepareOk quorum
# TYPE vsr_prepare_latency_ms histogram
vsr_prepare_latency_ms_bucket{replica_id="0",le="0.1"} 450
vsr_prepare_latency_ms_bucket{replica_id="0",le="0.5"} 890
vsr_prepare_latency_ms_bucket{replica_id="0",le="1"} 1200
vsr_prepare_latency_ms_bucket{replica_id="0",le="+Inf"} 1250
vsr_prepare_latency_ms_sum{replica_id="0"} 845.32
vsr_prepare_latency_ms_count{replica_id="0"} 1250

Performance Profiling Hooks

1. Critical Path Timing

Measure consensus round-trip time:

// Start timer
let timer = Instant::now();

// ... consensus protocol ...

// Record duration
instrumentation::record_consensus_rtt(timer.elapsed());

Profiling Points:

  • Prepare send → PrepareOk quorum → Commit broadcast
  • Heartbeat round-trip time
  • View change complete time
  • Recovery complete time

2. Memory Allocation Tracking

Track allocations in hot paths:

#[cfg(feature = "profiling")]
{
    let before = get_allocated_bytes();

    // ... allocating operation ...

    let allocated = get_allocated_bytes() - before;
    instrumentation::record_allocation("log_append", allocated);
}

Tracked Allocations:

  • Log entry allocation
  • Message serialization
  • State machine effects
  • Repair buffer allocation

3. CPU Profiling Integration

Support for external profilers:

// Mark critical section start
#[cfg(feature = "profiling")]
instrumentation::profile_scope_start("prepare_handler");

// ... critical path code ...

// Mark critical section end
#[cfg(feature = "profiling")]
instrumentation::profile_scope_end("prepare_handler");

Integration With:

  • pprof (flamegraph generation)
  • perf (Linux perf tool)
  • Instruments (macOS profiler)
  • cargo-flamegraph

4. Network I/O Profiling

Track network bandwidth per message type:

instrumentation::record_network_send(
    message_type,
    message_size_bytes,
    destination_replica,
);

Tracked Metrics:

  • Bytes sent/received per message type
  • Message serialization time
  • Network latency distribution
  • Bandwidth utilization per replica

Implementation Plan

Phase 5.1: Core Metrics (~200 LOC)

File: crates/kimberlite-vsr/src/instrumentation.rs

// Extend existing file with production metrics

/// Production metrics (always available, not feature-gated)
pub struct Metrics {
    // Histograms
    prepare_latency: Histogram,
    commit_latency: Histogram,
    client_latency: Histogram,

    // Counters
    operations_total: Counter,
    bytes_written_total: Counter,
    messages_sent_total: CounterVec, // labeled by message_type

    // Gauges
    replica_status: Gauge,
    view_number: Gauge,
    commit_number: Gauge,
    log_size_bytes: Gauge,
}

impl Metrics {
    pub fn record_prepare_latency(&self, duration: Duration) {
        self.prepare_latency.observe(duration.as_secs_f64() * 1000.0);
    }

    pub fn increment_operations(&self, status: &str) {
        self.operations_total
            .with_label_values(&[status])
            .inc();
    }

    // ... more recording methods ...
}

Phase 5.2: OpenTelemetry Export (~150 LOC)

File: crates/kimberlite-vsr/src/instrumentation.rs

pub struct OtelExporter {
    meter: Meter,
    exporter_type: ExporterType,
}

pub enum ExporterType {
    Prometheus { endpoint: String },
    Otlp { endpoint: String },
    StatsD { endpoint: String },
}

impl OtelExporter {
    pub fn new(exporter_type: ExporterType) -> Result<Self> {
        let meter = match exporter_type {
            ExporterType::Prometheus { ref endpoint } => {
                opentelemetry_prometheus::exporter()
                    .with_endpoint(endpoint)
                    .init()
            }
            ExporterType::Otlp { ref endpoint } => {
                opentelemetry_otlp::new_pipeline()
                    .metrics(runtime::Tokio)
                    .with_exporter(
                        opentelemetry_otlp::new_exporter()
                            .tonic()
                            .with_endpoint(endpoint)
                    )
                    .build()?
            }
            ExporterType::StatsD { ref endpoint } => {
                // StatsD implementation
                // ...
            }
        };

        Ok(Self { meter, exporter_type })
    }

    pub fn export(&self, metrics: &Metrics) -> Result<()> {
        // Export metrics to configured backend
        // ...
    }
}

Phase 5.3: Profiling Hooks (~100 LOC)

File: crates/kimberlite-vsr/src/instrumentation.rs

#[cfg(feature = "profiling")]
pub mod profiling {
    use std::time::Instant;

    thread_local! {
        static SCOPE_STACK: RefCell<Vec<(&'static str, Instant)>> =
            RefCell::new(Vec::new());
    }

    pub fn profile_scope_start(name: &'static str) {
        SCOPE_STACK.with(|stack| {
            stack.borrow_mut().push((name, Instant::now()));
        });
    }

    pub fn profile_scope_end(name: &'static str) {
        SCOPE_STACK.with(|stack| {
            if let Some((scope_name, start)) = stack.borrow_mut().pop() {
                assert_eq!(scope_name, name, "mismatched profile scope");
                let duration = start.elapsed();
                record_profile_sample(name, duration);
            }
        });
    }

    fn record_profile_sample(name: &'static str, duration: Duration) {
        // Export to pprof/perf format
        // ...
    }
}

// Convenience macro
#[macro_export]
macro_rules! profile_scope {
    ($name:expr) => {
        #[cfg(feature = "profiling")]
        let _scope = $crate::instrumentation::profiling::ProfileScope::new($name);
    };
}

Phase 5.4: Integration (~200 LOC)

Wire up metrics throughout VSR:

File: crates/kimberlite-vsr/src/replica/normal.rs

pub(crate) fn on_prepare(mut self, ...) -> (Self, ReplicaOutput) {
    let timer = Instant::now();

    // ... existing logic ...

    // Record latency
    METRICS.record_prepare_latency(timer.elapsed());
    METRICS.increment_operations("success");

    (self, output)
}

Files to Modify:

  • replica/normal.rs: Prepare, PrepareOk, Commit, Heartbeat
  • replica/view_change.rs: View change latency
  • replica/recovery.rs: Recovery latency
  • log_scrubber.rs: Scrub throughput
  • repair_budget.rs: Repair latency

Performance Overhead Analysis

Microbenchmark Results (Expected)

OperationWithout MetricsWith MetricsOverhead
Record counterN/A~5 ns5 ns
Record histogramN/A~25 ns25 ns
Record gaugeN/A~3 ns3 ns
Prepare handler150 μs150.1 μs<0.1%
Commit handler80 μs80.05 μs<0.1%

Total Overhead: <1% in worst case

Optimization Techniques

  1. Atomic Operations: Use AtomicU64 for counters (lock-free)
  2. Thread-Local Storage: Reduce contention on histograms
  3. Batch Export: Export metrics every 10 seconds, not per-operation
  4. Feature Gates: Disable expensive profiling in production (cfg(feature = "profiling"))
  5. Lazy Initialization: Only create metrics on first use

Alert Thresholds (Monitoring Runbook)

Critical Alerts

AlertConditionSeverityAction
Quorum Lostvsr_prepare_ok_votes < vsr_quorum_size for 10sP0Immediate investigation
High Latencyvsr_client_latency_ms_p99 > 100ms for 1mP1Check network/disk
View Change Stormrate(vsr_view_changes_total[1m]) > 5P1Check leader health
Log Corruptionvsr_checksum_failures_total > 0P0Trigger repair
Memory Leakvsr_memory_used_bytes growing unboundedP1Investigate

Warning Alerts

AlertConditionSeverityAction
Replica Lagvsr_replica_lag_operations > 1000P2Monitor
Repair Stormrate(vsr_repairs_total[1m]) > 100P2Check backup health
High Byzantine Rejection Raterate(vsr_byzantine_rejections_total[1m]) > 10P2Investigate malicious replica

Dependencies

Cargo.toml additions:

[dependencies]
# OpenTelemetry
opentelemetry = "0.21"
opentelemetry-otlp = "0.14"
opentelemetry-prometheus = "0.14"
prometheus = "0.13"

# Optional: Profiling
[dev-dependencies]
pprof = { version = "0.13", features = ["flamegraph", "criterion"] }

[features]
profiling = []

Testing Strategy

Unit Tests

  1. Metric Recording: Verify counters increment correctly
  2. Histogram Accuracy: Verify percentiles calculated correctly
  3. Export Format: Verify Prometheus exposition format
  4. Performance Overhead: Microbenchmark metric recording

Integration Tests

  1. End-to-End Metrics: Run full VSR scenario, verify all metrics populated
  2. OTEL Export: Export to real Prometheus instance, verify scraping
  3. Profiling Overhead: Measure overhead with/without profiling enabled

Load Tests

  1. Sustained Throughput: 10k ops/sec for 1 hour, monitor metrics
  2. Burst Load: 50k ops/sec for 1 minute, verify metric accuracy
  3. Memory Stability: Run for 24 hours, verify no metric-related leaks

Success Criteria

  • <1% performance overhead in production
  • All critical metrics exported to Prometheus
  • OpenTelemetry integration working
  • Grafana dashboard templates created
  • Alert thresholds documented
  • Profiling hooks integrated
  • Zero metric-related bugs in production

OpenTelemetry Integration (Implemented)

Configuration

use kimberlite_vsr::instrumentation::{OtelConfig, OtelExporter};

let config = OtelConfig {
    otlp_endpoint: Some("http://localhost:4317".to_string()),
    export_interval_secs: 10,
    service_name: "kimberlite-vsr".to_string(),
    service_version: "0.4.0".to_string(),
};

let mut exporter = OtelExporter::new(config)?;
exporter.init_otlp()?;

Supported Backends

  1. OTLP (OpenTelemetry Protocol) - Push-based metrics to collector

    • Supports Prometheus, Jaeger, Zipkin backends
    • 10-second export interval (configurable)
    • Automatic batching and retry
  2. Prometheus - Pull-based metrics scraping

    • Native exposition format via METRICS.export_prometheus()
    • Standard /metrics HTTP endpoint
    • Compatible with all Prometheus-compatible tools
  3. StatsD - UDP push to StatsD daemon

    • Format: metric_name:value|type
    • Types: c (counter), g (gauge), ms (timer)
    • Zero-configuration, fire-and-forget

Example: Periodic Export

use std::time::Duration;
use tokio::time;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut exporter = OtelExporter::new(OtelConfig::default())?;
    exporter.init_otlp()?;

    let mut interval = time::interval(Duration::from_secs(10));
    loop {
        interval.tick().await;
        exporter.export_metrics()?;
    }
}

Example: StatsD Export

use std::net::UdpSocket;

let socket = UdpSocket::bind("0.0.0.0:0")?;
let exporter = OtelExporter::new(OtelConfig::default())?;

for line in exporter.export_statsd() {
    socket.send_to(line.as_bytes(), "localhost:8125")?;
}

Feature Flag

OpenTelemetry integration is optional and requires the otel feature:

[dependencies]
kimberlite-vsr = { version = "0.4", features = ["otel"] }

This avoids pulling in OTLP dependencies (~40 crates) when not needed.


References

  • OpenTelemetry Spec: https://opentelemetry.io/docs/specs/otel/
  • Prometheus Best Practices: https://prometheus.io/docs/practices/naming/
  • TigerBeetle Instrumentation: Inspiration for low-overhead metrics
  • FoundationDB Metrics: Latency histogram design

Implementation Status

  1. Design complete (this document)
  2. Implement core metrics (~470 LOC)
  3. Add OTEL export (~230 LOC)
  4. ⏳ Add profiling hooks (~100 LOC)
  5. Integrate into VSR (~35 LOC)
  6. ⏳ Create Grafana dashboards
  7. ⏳ Write monitoring runbook
  8. ⏳ Write incident response playbook