Viewstamped Replication (VSR) | Kimberlite Docs

On this page

Overview
Protocol Components
Message Types (14 total)
Replica Roles
View Changes
Recovery (PAR with NACK)
Repair Protocol
Production Hardening Features
1. Clock Synchronization
2. Client Session Management
3. Repair Budget Management
4. Background Storage Scrubbing
5. Extended Timeout Coverage
6. Message Serialization Formal Verification
State Machine
Byzantine Fault Detection
Testing
VOPR Simulation (~50 substantive scenarios)
Formal Verification
Performance
Throughput
Latency
Network
Configuration
Debugging
Logging
Metrics
Common Issues
References
Academic Papers
Industry Implementations
Internal Documentation
Implementation Status

Kimberlite uses Viewstamped Replication Revisited (VRR) for state machine replication across a cluster of replicas. This document describes the current implementation.

Overview

VSR provides:

State machine replication: All replicas execute operations in the same order
Byzantine fault tolerance: Detects and rejects corrupted messages
Crash recovery: Replicas recover from crashes without data loss
View changes: Automatic leader election on primary failure
Repair protocol: Lagging replicas catch up from peers

Implementation: crates/kimberlite-vsr/ (~4,100 LOC)

Protocol Components

Message Types (14 total)

All messages are checksummed with CRC32 for Byzantine fault detection.

Formal Verification:

Coq specification: specs/coq/MessageSerialization.v (3 theorems)
Kani proofs: 14 proofs (one per message type, Proofs #42-#55)
Property-based tests: proptest roundtrip validation (10K iterations per type)

Theorems proven:

SerializeRoundtrip: deserialize(serialize(msg)) = msg - no data loss
DeterministicSerialization: Same message always produces same bytes
BoundedMessageSize: All messages ≤ 64MB (DoS protection)

Message	Purpose	Sender	Receiver
`Prepare`	Propose operation for consensus	Primary	Backups
`PrepareOk`	Acknowledge prepare	Backup	Primary
`Commit`	Notify committed operations	Primary	Backups
`Heartbeat`	Maintain liveness, embed clock samples	Primary	Backups
`HeartbeatReply`	Acknowledge heartbeat	Backup	Primary
`StartViewChange`	Initiate leader election	Any	All
`DoViewChange`	Transfer state to new leader	Backup	New Primary
`StartView`	Install new view	New Primary	All
`Request`	Client operation request	Client	Primary
`Reply`	Operation result	Primary	Client
`RequestNack`	Reject invalid request	Primary	Client
`Repair`	Request missing operations	Lagging Replica	Peer
`RepairReply`	Send missing operations	Peer	Lagging Replica
`Ping`	Check replica liveness	Any	Any

Replica Roles

Primary: Assigns operation numbers, drives consensus, responds to clients
Backup: Validates and acknowledges operations, participates in view changes
Idle: Not participating in current view (e.g., during recovery)

View Changes

When the primary fails:

Timeout detection: Backups detect missing heartbeats (1s timeout)
StartViewChange: Backups initiate election for view v+1
DoViewChange: Backups send their log state to new primary
Log merge: New primary merges logs (selecting highest op number per slot)
StartView: New primary broadcasts new view, resumes operations

Implementation: crates/kimberlite-vsr/src/replica/view_change.rs

Recovery (PAR with NACK)

Crashed replicas recover using Prepared-And-Replies (PAR):

Broadcast Ping: Ask all replicas for their current view
Receive Replies: Collect log state from peers
Log reconstruction: Merge logs, verify checksums
Rejoin cluster: Transition to backup role in current view

NACK optimization: If log head is far behind, trigger state transfer instead of incremental repair.

Implementation: crates/kimberlite-vsr/src/replica/recovery.rs

Repair Protocol

Lagging replicas catch up incrementally:

Detect gap: Backup receives Commit for op N but only has N-k
Request repair: Send Repair message to peer with ops [N-k, N)
Receive ops: Peer sends RepairReply with missing operations
Apply ops: Validate checksums, execute operations, update log

Rate limiting: See “Repair Budget Management” below.

Implementation: crates/kimberlite-vsr/src/replica/repair.rs

Production Hardening Features

1. Clock Synchronization

Module: crates/kimberlite-vsr/src/clock.rs

Cluster-wide clock synchronization for accurate audit timestamps (HIPAA/GDPR compliance).

Key features:

Marzullo’s algorithm for quorum-based interval intersection
Clock samples embedded in Heartbeat messages
Epoch-based synchronization (amortizes sample collection cost)
Offset tolerance enforcement (max 500ms drift)
Monotonicity guarantees (timestamps never decrease)

Formal verification:

TLA+ spec: specs/tla/ClockSync.tla (2 theorems)
Kani proofs: 5 proofs (#21-25)
VOPR scenarios: 4 scenarios
Production assertions: 5 assertions

Documentation: docs/internals/clock-synchronization.md

Status: Complete (Phase 1.1)

2. Client Session Management

Module: crates/kimberlite-vsr/src/client_sessions.rs

Fixes two critical bugs in the VRR paper:

Bug #1: Request collisions after client crash (wrong cached replies returned)
Bug #2: Client lockout after view change (uncommitted table updates)

Solution:

Explicit session registration with unique ClientId
Separate committed/uncommitted session tracking
Deterministic eviction by commit_timestamp (oldest first)
View change discards only uncommitted sessions

Formal verification:

TLA+ spec: specs/tla/ClientSessions.tla (6 properties)
Kani proofs: 4 proofs (#26-29)
VOPR scenarios: 3 scenarios
Production assertions: 6 assertions

Documentation: docs/internals/client-sessions.md

Status: Complete (Phase 1.2)

3. Repair Budget Management

Module: crates/kimberlite-vsr/src/repair_budget.rs

Prevents repair storms that overwhelm cluster send queues (TigerBeetle production bug).

Key features:

EWMA latency tracking (alpha = 0.2) to route repairs to fastest replicas
Inflight limit: max 2 requests per replica
Request expiry: 500ms timeout
Replica selection: 90% fastest, 10% experiment (discover recovery)

Formal verification:

TLA+ spec: specs/tla/RepairBudget.tla (6 properties)
Kani proofs: 3 proofs (#30-32)
VOPR scenarios: 3 scenarios
Production assertions: 4 assertions

Documentation: docs/internals/repair-budget.md

Status: Complete (Phase 1.3)

4. Background Storage Scrubbing

Module: crates/kimberlite-vsr/src/log_scrubber.rs

Detects latent sector errors proactively to prevent double-fault data loss (Google study: >60% of latent errors found by scrubbers).

Key features:

Tour-based scrubbing (24-hour period for complete tour)
PRNG-based tour origin (prevents synchronized scrub spikes)
Rate limiting: 10 IOPS per second (reserves 90% for production)
CRC32 checksum validation on every record
Automatic repair triggering on corruption detection

Formal verification:

TLA+ spec: specs/tla/Scrubbing.tla (10 properties)
Kani proofs: 3 proofs (#33-35)
VOPR scenarios: 4 scenarios
Production assertions: 3 assertions

Documentation: docs/internals/log-scrubbing.md

Status: Complete (Phase 2.1)

5. Extended Timeout Coverage

Module: crates/kimberlite-vsr/src/replica/mod.rs (TimeoutKind enum)

Comprehensive timeout coverage ensures liveness under all failure modes, preventing deadlocks and ensuring forward progress.

Timeout Types (12 total):

Timeout	Purpose	Handler	Key Property
`Heartbeat`	Backup detects primary failure	`on_heartbeat_timeout()`	Triggers view change
`Prepare`	Leader retransmits unacknowledged prepares	`on_prepare_timeout()`	Ensures prepare delivery
`ViewChange`	View change taking too long	`on_view_change_timeout()`	Escalates election
`Recovery`	Recovery stuck	`on_recovery_timeout()`	Retries recovery
`ClockSync`	Leader attempts clock sync	`on_clock_sync_timeout()`	Periodic synchronization
`Ping`	Health check (always running)	`on_ping_timeout()`	Early failure detection
`PrimaryAbdicate`	Leader partitioned from quorum	`on_primary_abdicate_timeout()`	Prevents deadlock
`RepairSync`	Repair not progressing	`on_repair_sync_timeout()`	Escalates to state transfer
`CommitStall`	Commits not advancing	`on_commit_stall_timeout()`	Pipeline backpressure
`CommitMessage`	Commit messages delayed/dropped	`on_commit_message_timeout()`	Heartbeat fallback (Phase 2.2)
`StartViewChangeWindow`	Wait for DoViewChange votes	`on_start_view_change_window_timeout()`	Prevents split-brain (Phase 2.2)
`Scrub`	Background checksum validation	`on_scrub_timeout()`	Corruption detection

Phase 2.2 Additions:

CommitMessage timeout: When commit messages are delayed or dropped, leader sends heartbeat fallback to piggyback commit progress, ensuring backups eventually learn about commits.
StartViewChangeWindow timeout: After receiving StartViewChange quorum, new leader waits before installing view. This prevents premature view changes that could cause split-brain when DoViewChange votes are delayed.

Liveness Properties:

Added 4 TLA+ liveness properties to specs/tla/VSR.tla:

EventualProgress: Committed operations eventually execute
NoDeadlock: System never gets stuck
ViewChangeEventuallyCompletes: Elections eventually complete
PartitionedPrimaryAbdicates: Partitioned leader steps down

Formal verification:

TLA+ spec: specs/tla/VSR.tla (6 liveness properties, 2 timeout properties)
Kani proofs: 2 proofs (#36-37) for timeout handler correctness
VOPR scenarios: 4 scenarios (ping, commit message fallback, window timeout, comprehensive)
Production assertions: 4 assertions (leader-only, status checks, quorum validation)

Implementation:

crates/kimberlite-vsr/src/replica/normal.rs: Timeout handlers
All timeouts handled via pure state machine transitions
No blocking operations, no I/O in timeout handlers

Status: Complete (Phase 2.2)

6. Message Serialization Formal Verification

Module: crates/kimberlite-vsr/src/message.rs

Formally verified serialization ensures all 14 VSR message types roundtrip correctly through serialization/deserialization, preventing Byzantine faults from corrupted network messages.

Message Types (14 total):

Message	Size Bound	Serialization Format	Critical Properties
`Prepare`	<10 KB	JSON	LogEntry + reconfiguration state
`PrepareOk`	<1 KB	JSON	Clock samples + version info
`Commit`	<500 B	JSON	View + commit number
`Heartbeat`	<500 B	JSON	Clock timestamps (monotonic + wall)
`StartViewChange`	<500 B	JSON	View + replica ID
`DoViewChange`	<5 KB	JSON	Log tail + reconfiguration state
`StartView`	<5 KB	JSON	Log tail for new view
`RecoveryRequest`	<500 B	JSON	Nonce + known op number
`RecoveryResponse`	<5 KB	JSON	View + log entries
`RepairRequest`	<500 B	JSON	Op range (start, end)
`RepairResponse`	<5 KB	JSON	Requested log entries
`Nack`	<500 B	JSON	Negative acknowledgment
`StateTransferRequest`	<500 B	JSON	Checkpoint request
`StateTransferResponse`	<1 MB	JSON	Checkpoint data

Key Properties Verified:

Serialization Roundtrip: deserialize(serialize(msg)) == msg for all message types
- Proven via 14 Kani proofs (one per type) + 10 property tests (10K iterations each)
- Guarantees no data loss through network transmission
Deterministic Serialization: Same message always produces identical bytes
- Critical for signature verification and Byzantine fault detection
- Verified via Kani proof + property test (10K iterations)
Bounded Message Size: All messages have maximum size limits
- Prevents DoS attacks via oversized messages
- Verified via property tests with size assertions
Malformed Message Rejection: Invalid bytes fail gracefully (no panic/corruption)
- Tested with 1M random byte sequences
- Ensures Byzantine replicas cannot crash honest replicas

Serialization Format:

Kimberlite uses JSON (serde_json) for message serialization:

Human-readable: Simplifies debugging and network protocol inspection
Schema evolution: Field addition/removal handled gracefully
Cross-language: SDKs in Python/Node.js can easily integrate
Trade-off: Larger message sizes vs binary (acceptable for VSR’s message rates)

Formal verification:

Kani proofs: 14 proofs (#38-51) for all message types
Property tests: 10 property-based tests with 10K-1M iterations
Malformed rejection: 1M random byte sequences tested

Implementation:

crates/kimberlite-vsr/src/message.rs: All message type definitions
crates/kimberlite-vsr/src/kani_proofs.rs: Kani verification harnesses
Property tests in #[cfg(test)] mod tests using proptest

Status: Complete (Phase 2.3)

State Machine

VSR uses a pure functional state machine pattern:

pub fn apply_committed(state: State, cmd: Command) -> Result<(State, Vec<Effect>)>

Properties:

Determinism: Same command sequence → same final state
No I/O: State machine is pure (no disk, network, or time)
Effect-based: Side effects returned as data, executed by shell

Integration: crates/kimberlite-kernel/ implements the state machine, crates/kimberlite-vsr/ provides replication.

Byzantine Fault Detection

All messages validated for:

CRC32 checksum: Detect corrupted messages
Signature verification: Ed25519 for critical operations
Monotonicity: View/op numbers never decrease
Quorum validation: Require f+1 replicas for 2f+1 cluster

Rejection policy: Invalid messages are logged and dropped (never crash).

Testing

VOPR Simulation (~50 substantive scenarios)

Antithesis-inspired deterministic simulation testing (not hypervisor-level; full determinism validated in CI):

Byzantine attacks (10 patterns): SplitBrain, MaliciousLeader, PrepareEquivocation
Corruption detection (5 scenarios): Checksum validation, hash chain verification
Crash recovery (6 scenarios): Single/multiple crashes, partition during recovery
Gray failures (4 scenarios): Clock drift, packet corruption, asymmetric partition
Race conditions (5 scenarios): Concurrent view changes, commit/prepare race
Clock issues (4 scenarios): Drift, backward jump, NTP failure
Client sessions (3 scenarios): Crash, view change lockout, eviction
Repair/timeout (12 scenarios): Budget, EWMA, sync timeout, stall detection, ping, commit message fallback, view change window, comprehensive
Scrubbing (4 scenarios): Corruption detection, tour completion, rate limiting
Reconfiguration (7 scenarios): Add/remove replicas, rolling upgrades

Performance: 85k-167k simulations/second with full fault injection

Documentation: docs/TESTING.md

Formal Verification

TLA+: 7 specifications, 41 properties verified (6 liveness + 2 timeout properties)
Kani: 51 proofs (bounded model checking, +2 timeout proofs, +14 message serialization proofs)
Property tests: 10 message serialization tests (10K-1M iterations each)
Coq: 15 theorems (not yet implemented)
Production assertions: 34 assertions (using assert!(), +4 timeout assertions)

Traceability: 100% theorem coverage (TLA+ → Rust → VOPR)

Documentation: docs/traceability_matrix.md

Performance

Throughput

Single replica: 100K ops/sec (sequential operations)
3-replica cluster: 60K ops/sec (with replication overhead)
5-replica cluster: 40K ops/sec (more consensus rounds)

Latency

p50: 1.2ms (local cluster)
p99: 3.5ms (local cluster)
p99.9: 8ms (local cluster)

Network

Message size: ~100 bytes (Prepare), ~50 bytes (PrepareOk/Commit)
Bandwidth: ~6 MB/sec for 60K ops/sec (3-replica cluster)

Benchmark: cargo bench --package kimberlite-vsr

Configuration

pub struct ReplicaConfig {
    pub replica_id: ReplicaId,
    pub cluster_size: usize,
    pub heartbeat_interval_ms: u64,    // Default: 100ms
    pub election_timeout_ms: u64,      // Default: 1000ms
    pub max_inflight_prepares: usize,  // Default: 100
    pub repair_budget_iops: usize,     // Default: 10
    pub scrub_iops_budget: usize,      // Default: 10
}

Tuning:

Increase heartbeat_interval_ms for WAN deployments (reduce bandwidth)
Increase election_timeout_ms for high-latency networks (reduce spurious elections)
Adjust IOPS budgets based on disk performance

Debugging

Logging

Use RUST_LOG=kimberlite_vsr=debug for detailed protocol logging:

RUST_LOG=kimberlite_vsr=debug cargo run

Metrics

Key metrics to monitor:

vsr_view_number - Current view (increases on leader election)
vsr_op_number - Current operation number (increases monotonically)
vsr_commit_latency_ms - Time from prepare to commit
vsr_repair_budget_inflight - Inflight repair requests per replica
vsr_scrub_blocks_total - Total blocks scrubbed (should increase steadily)
vsr_corruption_detected_total - Corruptions found (should be rare)

Common Issues

Issue: Frequent view changes Diagnosis: Network partition or primary overload Fix: Check network latency, increase election_timeout_ms

Issue: Slow commit latency Diagnosis: Backups not responding to prepares Fix: Check backup CPU/disk load, verify network connectivity

Issue: Repair storms (high bandwidth usage) Diagnosis: Multiple lagging replicas sending unbounded repair requests Fix: Verify RepairBudget is enforcing inflight limits (should be complete)

References

Academic Papers

Liskov, B., & Cowling, J. (2012). “Viewstamped Replication Revisited” (VRR paper)
Ongaro, D., & Ousterhout, J. (2014). “In Search of an Understandable Consensus Algorithm” (Raft)

Industry Implementations

TigerBeetle: src/vsr/ (~30K LOC) - Production-hardened VSR
FoundationDB: Simulation testing methodology
Rqlite: Raft-based SQLite replication

Internal Documentation

docs/internals/clock-synchronization.md - Clock sync implementation
docs/internals/client-sessions.md - VRR bug fixes
docs/internals/repair-budget.md - Repair storm prevention
docs/internals/log-scrubbing.md - Background scrubbing
docs/TESTING.md - VOPR simulation testing
docs/traceability_matrix.md - Formal verification traceability

Implementation Status

Component	Status	LOC	Verification
Core protocol (14 message types)	Complete	~4,100	46 VOPR scenarios
Clock synchronization	Complete	~350	5 Kani + 1 TLA+ + 4 VOPR
Client session management	Complete	~944	4 Kani + 1 TLA+ + 3 VOPR
Repair budget management	Complete	~737	3 Kani + 1 TLA+ + 3 VOPR
Background scrubbing	Complete	~738	3 Kani + 1 TLA+ + 4 VOPR

Total: ~6,900 LOC with 35 Kani proofs, 7 TLA+ specs, 46 VOPR scenarios