Kimberlite Internals

Cluster Reconfiguration

On this page

Module: crates/kimberlite-vsr/src/reconfiguration.rs TLA+ Spec: specs/tla/Reconfiguration.tla Kani Proofs: crates/kimberlite-vsr/src/reconfiguration.rs (Proofs #57-#62) VOPR Scenarios: 6 scenarios (ReconfigAddReplicas, ReconfigRemoveReplicas, ReconfigDuringPartition, ReconfigDuringViewChange, ReconfigConcurrentRequests, ReconfigJointQuorumValidation)


Overview

Kimberlite’s cluster reconfiguration enables zero-downtime addition and removal of replicas using joint consensus (Raft-style). The protocol ensures that no split-brain scenarios occur during membership changes.

The Split-Brain Problem

Problem: Naive reconfiguration can cause split-brain: simultaneously switching from old config (C_old) to new config (C_new) creates a window where two disjoint quorums can commit conflicting operations.

Example:

1. Start: 3-node cluster {R0, R1, R2} → quorum = 2
2. Naive switch: 5-node cluster {R0, R1, R2, R3, R4} → quorum = 3
3. During transition:
   - Old quorum {R0, R1} commits Op A (valid in C_old)
   - New quorum {R2, R3, R4} commits Op B (valid in C_new)
   - SPLIT-BRAIN: Conflicting commits at same op number!

Impact: Data corruption, loss of consensus safety, linearizability violations.

Solution: Joint consensus - require quorum in BOTH C_old and C_new simultaneously.


Joint Consensus Algorithm

Three-State Protocol

Stable (C_old) → Joint (C_old,new) → Stable (C_new)

1. Stable State (C_old)

  • Single configuration
  • Normal quorum rules: ⌈n/2⌉ + 1
  • Leader proposes reconfiguration command

2. Joint Consensus State (C_old,new)

  • Two configurations active simultaneously
  • Joint quorum rule: Require quorum in BOTH C_old AND C_new
  • Any two quorums must overlap (safety property)
  • Leader uses C_old for leader election (stability)

3. Transition to Stable (C_new)

  • After committing joint operation (C_old,new)
  • Automatically transition to new stable state
  • New configuration becomes sole authority

Quorum Calculation

// Stable state: standard quorum
fn quorum_size_stable(cluster_size: usize) -> usize {
    (cluster_size / 2) + 1
}

// Joint state: max of both quorums
fn quorum_size_joint(old_size: usize, new_size: usize) -> usize {
    max(quorum_size_stable(old_size), quorum_size_stable(new_size))
}

// Joint state: must have quorum in BOTH configs
fn has_joint_quorum(voters: &[ReplicaId], old: &Config, new: &Config) -> bool {
    let old_count = voters.iter().filter(|r| old.contains(**r)).count();
    let new_count = voters.iter().filter(|r| new.contains(**r)).count();

    old_count >= old.quorum_size() && new_count >= new.quorum_size()
}

Why Joint Consensus Works

Property: Any two quorums (Q1, Q2) must overlap (Q1 ∩ Q2 ≠ ∅)

Proof sketch:

  1. During joint consensus, need quorum in BOTH C_old and C_new
  2. Q1 has ≥⌈|C_old|/2⌉+1 from C_old and ≥⌈|C_new|/2⌉+1 from C_new
  3. Q2 has same constraints
  4. By pigeonhole principle, Q1 and Q2 must overlap in both C_old and C_new
  5. Therefore, Q1 ∩ Q2 ≠ ∅

Result: No split-brain possible - conflicting commits cannot occur.


Solution Architecture

ReconfigState

pub enum ReconfigState {
    /// Stable state with a single configuration
    Stable {
        config: ClusterConfig,
    },

    /// Joint consensus state with two configurations
    Joint {
        /// Old (pre-reconfiguration) configuration
        old_config: ClusterConfig,

        /// New (target) configuration
        new_config: ClusterConfig,

        /// Operation number where C_old,new was committed
        joint_op: OpNumber,
    },
}

Key Methods

1. State Queries

state.is_stable();    // true if Stable
state.is_joint();     // true if Joint
state.stable_config(); // Some(&config) if Stable, None otherwise
state.leader_config(); // OLD config during joint (for stability)

2. Quorum Validation

// Stable: standard quorum check
state.has_quorum(&[R0, R1]); // true if ≥ quorum_size

// Joint: requires quorum in BOTH
state.has_quorum(&[R0, R1, R2, R3]); // true only if quorums in old AND new

3. Transition Logic

// Check if ready to transition (after committing joint_op)
if state.ready_to_transition(commit_number) {
    state.transition_to_new(); // Joint → Stable (C_new)
}

ReconfigCommand

pub enum ReconfigCommand {
    /// Add a new replica to the cluster
    AddReplica(ReplicaId),

    /// Remove a replica from the cluster
    RemoveReplica(ReplicaId),

    /// Replace multiple replicas atomically
    Replace {
        add: Vec<ReplicaId>,
        remove: Vec<ReplicaId>,
    },
}

Validation Rules:

  • Cluster size must be odd (2f+1 for f failures)
  • Cannot add replica already in cluster
  • Cannot remove replica not in cluster
  • Result must be non-empty
  • Result must not exceed MAX_REPLICAS

Implementation Details

Leader Proposes Reconfiguration

// 1. Leader validates command
let new_config = cmd.validate(&self.config)?;

// 2. Transition to joint consensus
self.reconfig_state = ReconfigState::new_joint(
    self.config.clone(),
    new_config,
    joint_op,
);

// 3. Broadcast Prepare with reconfig command
let prepare = Prepare {
    view: self.view,
    op_number: joint_op,
    entry: log_entry,
    reconfig: Some(cmd), // ← Command included in Prepare
};
broadcast(prepare);

Backups Process Reconfiguration

// On receiving Prepare with reconfig command
fn on_prepare(&mut self, prepare: Prepare) {
    // ... normal prepare processing ...

    // If reconfig command present, transition to joint
    if let Some(reconfig_cmd) = prepare.reconfig {
        self = self.apply_reconfiguration_command(reconfig_cmd, prepare.op_number);
        // Now in Joint state
    }
}

Committing During Joint Consensus

// Leader commits after receiving joint quorum
let voters = prepare_ok_voters.union([leader_id]);
let has_quorum = self.reconfig_state.has_quorum(&voters);

if has_quorum {
    self.commit_number = op_number;

    // Check if we can transition from Joint → Stable
    if self.reconfig_state.ready_to_transition(self.commit_number.as_op_number()) {
        // Extract new config before transition
        let new_config = self.reconfig_state.configs().1.unwrap().clone();

        // Transition to stable state
        self.reconfig_state.transition_to_new();

        // Update cluster config
        self.config = new_config;

        // Reconfiguration complete!
    }
}

View Change Preservation

Problem: Leader failure during joint consensus must preserve reconfiguration state.

Solution: Include reconfig_state in DoViewChange and StartView messages.

// DoViewChange includes reconfig state
let dvc = DoViewChange {
    view: new_view,
    replica_id: self.replica_id,
    op_number: self.op_number,
    commit_number: self.commit_number,
    log_tail: self.log_tail(),
    reconfig_state: Some(self.reconfig_state.clone()), // ← Preserved
};

// New leader extracts reconfig state from best DoViewChange
let best_dvc = do_view_change_msgs.max_by_key(|dvc| dvc.op_number);
if let Some(reconfig_state) = best_dvc.reconfig_state {
    self.reconfig_state = reconfig_state; // ← Restored
}

// StartView distributes reconfig state to backups
let start_view = StartView {
    view: self.view,
    op_number: self.op_number,
    commit_number: self.commit_number,
    log_tail: self.log_tail(),
    reconfig_state: Some(self.reconfig_state.clone()), // ← Propagated
};

Formal Verification

TLA+ Specification (specs/tla/Reconfiguration.tla)

Properties Verified:

  1. ConfigurationSafety: Never have conflicting committed configurations

    • ∀ c1, c2 ∈ committed_configs: c1 ≠ c2 ⇒ c1 ∩ c2 ≠ ∅
  2. QuorumOverlap: Any two quorums in any committed config must overlap

    • ∀ config, ∀ q1, q2 ∈ quorums(config): q1 ∩ q2 ≠ ∅
  3. JointConsensusInvariants: Joint state maintains correct structure

    • is_joint() ⇒ new_config ≠ ∅
    • is_joint() ⇒ joint_op > 0
    • is_stable() ⇒ new_config = ∅
  4. ViewChangePreservesReconfig: View changes preserve reconfiguration state

    • DoViewChange messages include reconfig_state
    • StartView messages distribute reconfig_state
  5. Progress: Joint consensus eventually becomes stable

    • ◇(is_stable())

Model checked: TLC verifies all invariants with 5-replica cluster, 3→5→3 reconfiguration sequence.

Kani Proofs (6 proofs)

  1. Proof 57: Quorum overlap in joint consensus

    • Property: Any two joint quorums must overlap
    • Verified: No disjoint quorums possible during joint consensus
  2. Proof 58: Configuration transition safety

    • Property: transition_to_new() produces valid stable state
    • Verified: Stable config matches new_config after transition
  3. Proof 59: Leader config stability

    • Property: leader_config() returns old_config during joint
    • Verified: Leader election uses old config (prevents instability)
  4. Proof 60: All replicas union correctness

    • Property: all_replicas() = old_config ∪ new_config
    • Verified: Union contains all replicas, no duplicates, sorted order
  5. Proof 61: Validation logic correctness

    • Property: validate() enforces odd cluster size, non-empty, no duplicates
    • Verified: Invalid commands rejected, valid commands produce correct config
  6. Proof 62: Ready to transition logic

    • Property: ready_to_transition() iff commit_number ≥ joint_op
    • Verified: Transition occurs exactly when joint_op committed

VOPR Testing (6 scenarios)

1. ReconfigAddReplicas

Test: Joint consensus safely adds replicas (3 → 5) Verify: No split-brain, quorum preserved throughout Config: 20s runtime, 10K events, no faults (baseline)

2. ReconfigRemoveReplicas

Test: Joint consensus safely removes replicas (5 → 3) Verify: Quorum preserved, removed replicas excluded after transition Config: 20s runtime, 10K events, no faults

3. ReconfigDuringPartition

Test: Reconfiguration survives network partitions Verify: Joint consensus completes despite 10% packet loss Config: 30s runtime, 15K events, aggressive swizzle-clogging

4. ReconfigDuringViewChange

Test: View change during joint consensus preserves state Verify: Leader failure doesn’t abort reconfiguration Config: 25s runtime, 12K events, 5% packet loss + mild clogging

5. ReconfigConcurrentRequests

Test: Concurrent reconfiguration requests are rejected Verify: Only one reconfiguration active at a time Config: 20s runtime, 10K events, multiple reconfig commands

6. ReconfigJointQuorumValidation

Test: Joint consensus requires quorum in BOTH configs Verify: Cannot commit with quorum in only old or only new Config: 20s runtime, 10K events, targeted fault injection

All scenarios pass: 50K iterations per scenario, 0 violations


Performance Characteristics

Memory Overhead

  • ReconfigState (Stable): ~120 bytes (1 ClusterConfig)
  • ReconfigState (Joint): ~240 bytes (2 ClusterConfigs + OpNumber)
  • Per-message overhead: +8 bytes (Option tag)

Impact: Negligible (<0.1% total memory)

Latency Impact

  • Stable state: No overhead (standard VSR)
  • Joint consensus: ~5-10% higher commit latency (larger quorum)
  • Transition: <1ms (in-memory state change)

Throughput Impact

Baseline: 85k-167k sims/sec During joint: 75k-150k sims/sec (~10% reduction due to larger quorum) After transition: Returns to baseline

Typical reconfiguration duration: 2-5 seconds (3→5 nodes, 10Mbps network)


Integration with VSR

Leader Flow

// 1. Receive reconfiguration command (from admin or automation)
let cmd = ReconfigCommand::Replace {
    add: vec![R3, R4],
    remove: vec![],
};

// 2. Validate and enter joint consensus
let new_config = cmd.validate(&self.config)?;
self.reconfig_state = ReconfigState::new_joint(
    self.config.clone(),
    new_config,
    self.op_number + 1,
);

// 3. Prepare and replicate (joint quorum required)
let prepare = Prepare::with_reconfig(self.view, self.op_number, entry, cmd);
broadcast(prepare);

// 4. After joint quorum PrepareOk, commit
let voters = prepare_ok_voters.union([self.replica_id]);
if self.reconfig_state.has_quorum(&voters) {
    self.commit_number = self.op_number;

    // 5. Transition to new stable state
    if self.reconfig_state.ready_to_transition(self.commit_number.as_op_number()) {
        let new_config = self.reconfig_state.configs().1.unwrap().clone();
        self.reconfig_state.transition_to_new();
        self.config = new_config;
        // Reconfiguration complete!
    }
}

Backup Flow

// 1. Receive Prepare with reconfig command
fn on_prepare(&mut self, prepare: Prepare) {
    // Standard prepare validation
    validate_prepare(&prepare)?;

    // Apply reconfiguration if present
    if let Some(reconfig_cmd) = prepare.reconfig {
        let new_config = reconfig_cmd.validate(&self.config)?;
        self.reconfig_state = ReconfigState::new_joint(
            self.config.clone(),
            new_config,
            prepare.op_number,
        );
    }

    // Send PrepareOk
    send_prepare_ok(prepare.op_number);
}

// 2. Receive Commit message
fn on_commit(&mut self, commit: Commit) {
    self.commit_number = commit.commit_number;

    // Transition if ready
    if self.reconfig_state.ready_to_transition(self.commit_number.as_op_number()) {
        let new_config = self.reconfig_state.configs().1.unwrap().clone();
        self.reconfig_state.transition_to_new();
        self.config = new_config;
    }
}

View Change Flow

// 1. Replica starts view change
fn on_start_view_change(&mut self, new_view: ViewNumber) {
    let dvc = DoViewChange {
        view: new_view,
        replica_id: self.replica_id,
        op_number: self.op_number,
        commit_number: self.commit_number,
        log_tail: self.log_tail(),
        reconfig_state: Some(self.reconfig_state.clone()), // ← Include state
    };
    broadcast(dvc);
}

// 2. New leader collects DoViewChange messages
fn become_leader(&mut self, dvc_messages: &[DoViewChange]) {
    // Pick DoViewChange with highest op_number
    let best_dvc = dvc_messages.iter().max_by_key(|dvc| dvc.op_number).unwrap();

    // Restore reconfiguration state from best DoViewChange
    if let Some(reconfig_state) = &best_dvc.reconfig_state {
        self.reconfig_state = reconfig_state.clone();
    }

    // Send StartView with restored state
    let start_view = StartView {
        view: self.view,
        op_number: self.op_number,
        commit_number: self.commit_number,
        log_tail: self.log_tail(),
        reconfig_state: Some(self.reconfig_state.clone()),
    };
    broadcast(start_view);
}

// 3. Backups receive StartView
fn on_start_view(&mut self, sv: StartView) {
    // Restore reconfiguration state from new leader
    if let Some(reconfig_state) = sv.reconfig_state {
        self.reconfig_state = reconfig_state;
    }

    // Continue normal view change processing
    self.view = sv.view;
    self.merge_log_tail(sv.log_tail);
}

Debugging Guide

Common Issues

Issue: Reconfiguration stuck in joint consensus Diagnosis: Not achieving joint quorum (missing replicas) Fix: Check network connectivity to new replicas, verify quorum calculation Logs: joint consensus requires X from old, Y from new

Issue: Split-brain during reconfiguration Diagnosis: Quorum validation bug (should be impossible with joint consensus) Fix: Verify has_quorum() checks BOTH old and new configs Assertion: assert!(old_count >= old_quorum && new_count >= new_quorum)

Issue: View change aborts reconfiguration Diagnosis: Reconfiguration state not preserved Fix: Verify DoViewChange/StartView include reconfig_state Test: ReconfigDuringViewChange VOPR scenario

Issue: Concurrent reconfigurations accepted Diagnosis: Missing guard for “already in joint consensus” Fix: Reject new reconfig if is_joint() == true Test: ReconfigConcurrentRequests VOPR scenario

Assertions That Catch Bugs

AssertionWhat It CatchesLocation
is_joint() => new_config != emptyInvalid joint stateReconfigState::new_joint:88
is_joint() => joint_op > 0Missing joint_opReconfigState::new_joint:88
quorum in old AND newSplit-brainReconfigState::has_quorum:174
leader_config == old_configLeader instabilityReconfigState::leader_config:122
config == new_config after transitionTransition bugcommit_operation:865

Monitoring Metrics

Reconfiguration Duration:

reconfig_duration_seconds = time(transition_to_new) - time(new_joint)

Target: <5 seconds for 3→5 nodes on 10Mbps network

Joint Quorum Latency:

joint_commit_latency_ms = commit_time - prepare_time (during joint)

Target: <150ms p99 (10-20% higher than stable)

View Changes During Reconfig:

view_changes_during_reconfig_count

Target: 0 under normal operation (indicates leader instability)


References

Academic Papers

  • Ongaro, D., & Ousterhout, J. (2014). “In Search of an Understandable Consensus Algorithm (Extended Version)” - Section 6: Cluster Membership Changes
  • Liskov, B., & Cowling, J. (2012). “Viewstamped Replication Revisited” - Reconfiguration protocol

Industry Implementations

  • Raft: raft/membership.go (joint consensus implementation)
  • TigerBeetle: src/vsr.zig - Reconfiguration stub (not yet implemented)
  • Etcd: raft/raft.go - Learners and joint consensus

Internal Documentation

  • docs/concepts/consensus.md - VSR consensus overview
  • docs/traceability_matrix.md - TLA+ → Rust → VOPR traceability
  • specs/tla/Reconfiguration.tla - Formal specification

Future Work

  • Rolling upgrades - Version negotiation for backward compatibility
  • Standby replicas - Read-only followers for DR and read scaling
  • Learner mode - Pre-join state for new replicas (catch up before voting)
  • Two-phase reconfiguration - Separate C_old,new commit from C_new activation
  • Automated reconfiguration - Failure detection triggers automatic replacement
  • Heterogeneous configs - Different replica weights (weighted quorums)

Implementation Status: Complete (Phase 4.1 - v0.5.0) Verification: 6 Kani proofs, 6 VOPR scenarios, 1 TLA+ spec with 5 theorems Safety: Zero split-brain risk via joint consensus quorum overlap Tested: 300K VOPR iterations, 0 violations