Kimberlite Concepts

Consensus - Viewstamped Replication (VSR)

On this page

Kimberlite uses Viewstamped Replication (VSR) for consensus, the same protocol used by TigerBeetle.

What is Consensus?

Consensus ensures multiple replicas agree on the order of operations, even when some replicas fail or networks partition.

Without consensus:

  • Replica A thinks operation X happened first
  • Replica B thinks operation Y happened first
  • Divergent state → data corruption

With consensus:

  • All replicas agree: X, then Y
  • Same order → same state

Why VSR?

Kimberlite uses VSR instead of Raft or Paxos for several reasons:

PropertyVSRRaftPaxos
Simplicity Fewer states More complex Very complex
Determinism Explicit view Explicit term Implicit rounds
Recovery Strong repair Log catch-up only Weak
Production proven TigerBeetle Many systems Google (Chubby)

Key advantage: VSR was designed for state machine replication—exactly Kimberlite’s use case.

Cluster Topology

A Kimberlite cluster consists of 2f + 1 replicas to tolerate f failures:

f=1 (3 replicas):  Can tolerate 1 failure
f=2 (5 replicas):  Can tolerate 2 failures
f=3 (7 replicas):  Can tolerate 3 failures (typical production)

Example 3-replica cluster: one Primary (P) and two Backups (B) form the consensus group. The Primary coordinates all operations and assigns their order; Backups replicate and monitor.

Normal Operation (Happy Path)

When everything works, VSR is simple. The animation below shows the full message sequence for a single committed write:

Fig. 1 VSR normal operation — Request → Prepare → PrepareOK → Commit → Apply → Reply. Use Play or Step to walk through each phase.
<div class="vsr-flow__actors">
  <div class="vsr-flow__actor"
       data-class:is-active="$step === 0 || $step === 5">Client</div>
  <div class="vsr-flow__actor"
       data-class:is-active="$step === 1 || $step === 3 || $step === 5">Primary (P)</div>
  <div class="vsr-flow__actor"
       data-class:is-active="$step === 1 || $step === 2 || $step === 3 || $step === 4">Backup 1</div>
  <div class="vsr-flow__actor"
       data-class:is-active="$step === 1 || $step === 2 || $step === 3 || $step === 4">Backup 2</div>
</div>

<div class="vsr-flow__messages">
  <div class="vsr-flow__idle" data-show="$step < 0">Press Play or Step → to start the sequence.</div>

  <div class="vsr-flow__message" data-show="$step >= 0">
    <span class="vsr-flow__msg-from">Client</span>
    <span class="vsr-flow__msg-arrow">→</span>
    <span class="vsr-flow__msg-label"><strong>Request</strong> — command sent to Primary</span>
  </div>

  <div class="vsr-flow__message" data-show="$step >= 1">
    <span class="vsr-flow__msg-from">Primary</span>
    <span class="vsr-flow__msg-arrow">→→</span>
    <span class="vsr-flow__msg-label"><strong>Prepare</strong> — broadcast to all Backups with log position</span>
  </div>

  <div class="vsr-flow__message" data-show="$step >= 2">
    <span class="vsr-flow__msg-from">Backups</span>
    <span class="vsr-flow__msg-arrow">→</span>
    <span class="vsr-flow__msg-label"><strong>PrepareOK</strong> — acknowledged after writing to disk</span>
  </div>

  <div class="vsr-flow__message" data-show="$step >= 3">
    <span class="vsr-flow__msg-from">Primary</span>
    <span class="vsr-flow__msg-arrow">→→</span>
    <span class="vsr-flow__msg-label"><strong>Commit</strong> — quorum (f+1 PrepareOKs) reached</span>
  </div>

  <div class="vsr-flow__message" data-show="$step >= 4">
    <span class="vsr-flow__msg-from">All replicas</span>
    <span class="vsr-flow__msg-arrow">↻</span>
    <span class="vsr-flow__msg-label"><strong>Apply</strong> — kernel applies command, derives new state</span>
  </div>

  <div class="vsr-flow__message" data-show="$step >= 5">
    <span class="vsr-flow__msg-from">Primary</span>
    <span class="vsr-flow__msg-arrow">→</span>
    <span class="vsr-flow__msg-label"><strong>Reply</strong> — result + position token returned to Client</span>
  </div>
</div>

<span aria-hidden="true"
      data-on-interval__duration.800ms="$playing && $step < 5 ? $step++ : ($playing = false)"></span>

Steps:

  1. Client Request: Client sends command to primary
  2. Prepare: Primary assigns position in log, broadcasts Prepare to backups
  3. PrepareOK: Backups acknowledge with PrepareOK
  4. Commit: Primary receives quorum (f+1), broadcasts Commit
  5. Apply: All replicas apply the committed command
  6. Reply: Primary sends result to client

Key properties:

  • Quorum: Need f+1 PrepareOK messages (majority)
  • Order: Primary assigns sequential positions
  • Durability: Command must be on disk before PrepareOK
  • Determinism: All replicas apply commands in identical order

View Changes (Failure Handling)

When the primary fails, backups elect a new primary:

View 0:                View 1:
P, B1, B2              B1 (new P), B2, (old P offline)

View change protocol:

  1. Timeout: Backup detects primary failure (no heartbeat)
  2. Start View Change: Backup broadcasts StartViewChange
  3. Do View Change: Replicas send their state to new primary
  4. Start View: New primary synchronizes replicas, begins accepting requests

Critical invariant: View changes preserve all committed operations.

Why Views Matter

Views provide explicit epochs:

struct Operation {
    view: View,      // Which primary assigned this operation
    position: u64,   // Position within that view
}

This allows replicas to detect stale messages:

  • Message from view=5 arrives when cluster is in view=7 → ignored
  • Prevents split-brain scenarios

Repair Mechanisms

VSR includes mechanisms to repair replicas that have diverged:

1. Log Repair (Small Gaps)

Scenario: Backup missed a few messages.

Primary:  [Op 1] [Op 2] [Op 3] [Op 4]
Backup:   [Op 1] [Op 2] [  ?  ] [  ?  ]
                          ↑
                      RequestRepair

Solution: Backup requests missing operations, primary sends them.

2. State Transfer (Large Gaps)

Scenario: Backup is far behind (e.g., after extended downtime).

Primary:  [Op 1...10000]
Backup:   [Op 1...100]  (9900 operations behind)

Solution: Primary sends a snapshot + recent log tail.

3. Nack Protocol (Message Loss)

Scenario: Backup detects gap in sequence numbers.

Backup receives: Op 5, Op 7 (where's Op 6?)
Backup sends: Nack(6)
Primary resends: Op 6

Single-Node Mode (Development)

For development and testing, Kimberlite supports single-node operation:

// In single-node mode, VSR degenerates to:
// 1. Append to local log
// 2. Apply immediately
// 3. Return result

impl SingleNodeReplicator {
    fn propose(&mut self, command: Command) -> Result<Position> {
        let position = self.log.append(command)?;
        Ok(position)
    }
}

No network, no consensus overhead. Perfect for local development.

Safety Guarantees

VSR provides strong safety guarantees:

1. Agreement

Guarantee: If two replicas commit an operation at position P, they commit the same operation.

Replica A at P=100: INSERT patient_id=123
Replica B at P=100: INSERT patient_id=123 (must be identical)

2. Prefix Property

Guarantee: If replica A commits operation P, and replica B commits operation P’, then either:

  • A’s log is a prefix of B’s log, or
  • B’s log is a prefix of A’s log

No divergent histories.

3. Total Order

Guarantee: All replicas process operations in the same order.

4. Durability

Guarantee: Committed operations survive f failures.

If the cluster has 5 replicas (f=2), any 2 replicas can fail without data loss.

Liveness Guarantees

VSR also provides liveness (progress) guarantees:

1. Eventual Progress

Guarantee: If fewer than f replicas fail, the cluster eventually makes progress.

2. View Change Completion

Guarantee: View changes complete within bounded time (assuming asynchronous networks).

3. Repair Completion

Guarantee: Healthy replicas can repair lagging replicas.

Fault Tolerance

Cluster Sizef (tolerate)QuorumExplanation
1 replica01No fault tolerance (dev only)
3 replicas12Tolerate 1 failure
5 replicas23Tolerate 2 failures (typical)
7 replicas34Tolerate 3 failures (high availability)

Why odd numbers?

Even-sized clusters waste capacity:

  • 4 replicas (f=1, quorum=3) → same fault tolerance as 3 replicas
  • 6 replicas (f=2, quorum=4) → same fault tolerance as 5 replicas

Byzantine Fault Tolerance

Kimberlite does NOT provide BFT. VSR assumes crash-fail replicas (replicas fail by stopping, not by behaving maliciously).

However, VOPR testing includes Byzantine scenarios to detect implementation bugs. See:

Performance Characteristics

Latency:

  • Single-node: ~1ms (log append only)
  • 3-replica cluster (same datacenter): ~2-3ms
  • 3-replica cluster (cross-AZ): ~5-10ms
  • 5-replica cluster (cross-region): ~20-50ms

Throughput:

  • Bottleneck: Primary’s disk write bandwidth
  • Typical: 10k-50k ops/sec per cluster (depends on operation size)

Scalability:

  • VSR does not horizontally scale (not a distributed system like Cassandra)
  • For higher throughput: Shard across multiple VSR groups
  • See Multi-tenancy for sharding strategies

Comparison with Other Protocols

VSR vs Raft

FeatureVSRRaft
Leader electionView-basedTerm-based
Log repairBuilt-in state transferLog catch-up only
ComplexitySimpler (fewer states)More complex
Production useTigerBeetle, MemSQLetcd, CockroachDB, many others

Bottom line: VSR and Raft provide similar guarantees. VSR is slightly simpler, Raft has more tooling.

VSR vs Paxos

FeatureVSRPaxos
ModelState machine replicationConsensus on single values
DeterminismExplicit viewsImplicit rounds
Understandability Easier Notoriously difficult

Bottom line: Paxos is more general but harder to implement correctly. VSR is purpose-built for replicating state machines.

Further Reading

Academic References

  • Original VSR paper: Oki & Liskov (1988) - “Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems”
  • VRR (Revisited): Liskov & Cowling (2012) - “Viewstamped Replication Revisited”
  • TigerBeetle’s experience: TigerBeetle blog

Key Takeaway: VSR ensures all replicas agree on operation order, even when failures occur. It’s simpler than Raft, proven in production, and purpose-built for state machine replication—exactly what Kimberlite needs.