Cluster Reconfiguration
On this page
- Overview
- The Split-Brain Problem
- Joint Consensus Algorithm
- Three-State Protocol
- Quorum Calculation
- Why Joint Consensus Works
- Solution Architecture
- ReconfigState
- Key Methods
- ReconfigCommand
- Implementation Details
- Leader Proposes Reconfiguration
- Backups Process Reconfiguration
- Committing During Joint Consensus
- View Change Preservation
- Formal Verification
- TLA+ Specification (specs/tla/Reconfiguration.tla)
- Kani Proofs (6 proofs)
- VOPR Testing (6 scenarios)
- 1. ReconfigAddReplicas
- 2. ReconfigRemoveReplicas
- 3. ReconfigDuringPartition
- 4. ReconfigDuringViewChange
- 5. ReconfigConcurrentRequests
- 6. ReconfigJointQuorumValidation
- Performance Characteristics
- Memory Overhead
- Latency Impact
- Throughput Impact
- Integration with VSR
- Leader Flow
- Backup Flow
- View Change Flow
- Debugging Guide
- Common Issues
- Assertions That Catch Bugs
- Monitoring Metrics
- References
- Academic Papers
- Industry Implementations
- Internal Documentation
- Future Work
Module: crates/kimberlite-vsr/src/reconfiguration.rs
TLA+ Spec: specs/tla/Reconfiguration.tla
Kani Proofs: crates/kimberlite-vsr/src/reconfiguration.rs (Proofs #57-#62)
VOPR Scenarios: 6 scenarios (ReconfigAddReplicas, ReconfigRemoveReplicas, ReconfigDuringPartition, ReconfigDuringViewChange, ReconfigConcurrentRequests, ReconfigJointQuorumValidation)
Overview
Kimberlite’s cluster reconfiguration enables zero-downtime addition and removal of replicas using joint consensus (Raft-style). The protocol ensures that no split-brain scenarios occur during membership changes.
The Split-Brain Problem
Problem: Naive reconfiguration can cause split-brain: simultaneously switching from old config (C_old) to new config (C_new) creates a window where two disjoint quorums can commit conflicting operations.
Example:
1. Start: 3-node cluster {R0, R1, R2} → quorum = 2
2. Naive switch: 5-node cluster {R0, R1, R2, R3, R4} → quorum = 3
3. During transition:
- Old quorum {R0, R1} commits Op A (valid in C_old)
- New quorum {R2, R3, R4} commits Op B (valid in C_new)
- SPLIT-BRAIN: Conflicting commits at same op number!
Impact: Data corruption, loss of consensus safety, linearizability violations.
Solution: Joint consensus - require quorum in BOTH C_old and C_new simultaneously.
Joint Consensus Algorithm
Three-State Protocol
Stable (C_old) → Joint (C_old,new) → Stable (C_new)
1. Stable State (C_old)
- Single configuration
- Normal quorum rules: ⌈n/2⌉ + 1
- Leader proposes reconfiguration command
2. Joint Consensus State (C_old,new)
- Two configurations active simultaneously
- Joint quorum rule: Require quorum in BOTH C_old AND C_new
- Any two quorums must overlap (safety property)
- Leader uses C_old for leader election (stability)
3. Transition to Stable (C_new)
- After committing joint operation (C_old,new)
- Automatically transition to new stable state
- New configuration becomes sole authority
Quorum Calculation
// Stable state: standard quorum
// Joint state: max of both quorums
// Joint state: must have quorum in BOTH configs
Why Joint Consensus Works
Property: Any two quorums (Q1, Q2) must overlap (Q1 ∩ Q2 ≠ ∅)
Proof sketch:
- During joint consensus, need quorum in BOTH C_old and C_new
- Q1 has ≥⌈|C_old|/2⌉+1 from C_old and ≥⌈|C_new|/2⌉+1 from C_new
- Q2 has same constraints
- By pigeonhole principle, Q1 and Q2 must overlap in both C_old and C_new
- Therefore, Q1 ∩ Q2 ≠ ∅
Result: No split-brain possible - conflicting commits cannot occur.
Solution Architecture
ReconfigState
Key Methods
1. State Queries
state.is_stable; // true if Stable
state.is_joint; // true if Joint
state.stable_config; // Some(&config) if Stable, None otherwise
state.leader_config; // OLD config during joint (for stability)
2. Quorum Validation
// Stable: standard quorum check
state.has_quorum; // true if ≥ quorum_size
// Joint: requires quorum in BOTH
state.has_quorum; // true only if quorums in old AND new
3. Transition Logic
// Check if ready to transition (after committing joint_op)
if state.ready_to_transition
ReconfigCommand
Validation Rules:
- Cluster size must be odd (2f+1 for f failures)
- Cannot add replica already in cluster
- Cannot remove replica not in cluster
- Result must be non-empty
- Result must not exceed MAX_REPLICAS
Implementation Details
Leader Proposes Reconfiguration
// 1. Leader validates command
let new_config = cmd.validate?;
// 2. Transition to joint consensus
self.reconfig_state = new_joint;
// 3. Broadcast Prepare with reconfig command
let prepare = Prepare ;
broadcast;
Backups Process Reconfiguration
// On receiving Prepare with reconfig command
Committing During Joint Consensus
// Leader commits after receiving joint quorum
let voters = prepare_ok_voters.union;
let has_quorum = self.reconfig_state.has_quorum;
if has_quorum
View Change Preservation
Problem: Leader failure during joint consensus must preserve reconfiguration state.
Solution: Include reconfig_state in DoViewChange and StartView messages.
// DoViewChange includes reconfig state
let dvc = DoViewChange ;
// New leader extracts reconfig state from best DoViewChange
let best_dvc = do_view_change_msgs.max_by_key;
if let Some = best_dvc.reconfig_state
// StartView distributes reconfig state to backups
let start_view = StartView ;
Formal Verification
TLA+ Specification (specs/tla/Reconfiguration.tla)
Properties Verified:
ConfigurationSafety: Never have conflicting committed configurations
- ∀ c1, c2 ∈ committed_configs: c1 ≠ c2 ⇒ c1 ∩ c2 ≠ ∅
QuorumOverlap: Any two quorums in any committed config must overlap
- ∀ config, ∀ q1, q2 ∈ quorums(config): q1 ∩ q2 ≠ ∅
JointConsensusInvariants: Joint state maintains correct structure
- is_joint() ⇒ new_config ≠ ∅
- is_joint() ⇒ joint_op > 0
- is_stable() ⇒ new_config = ∅
ViewChangePreservesReconfig: View changes preserve reconfiguration state
- DoViewChange messages include reconfig_state
- StartView messages distribute reconfig_state
Progress: Joint consensus eventually becomes stable
- ◇(is_stable())
Model checked: TLC verifies all invariants with 5-replica cluster, 3→5→3 reconfiguration sequence.
Kani Proofs (6 proofs)
Proof 57: Quorum overlap in joint consensus
- Property: Any two joint quorums must overlap
- Verified: No disjoint quorums possible during joint consensus
Proof 58: Configuration transition safety
- Property: transition_to_new() produces valid stable state
- Verified: Stable config matches new_config after transition
Proof 59: Leader config stability
- Property: leader_config() returns old_config during joint
- Verified: Leader election uses old config (prevents instability)
Proof 60: All replicas union correctness
- Property: all_replicas() = old_config ∪ new_config
- Verified: Union contains all replicas, no duplicates, sorted order
Proof 61: Validation logic correctness
- Property: validate() enforces odd cluster size, non-empty, no duplicates
- Verified: Invalid commands rejected, valid commands produce correct config
Proof 62: Ready to transition logic
- Property: ready_to_transition() iff commit_number ≥ joint_op
- Verified: Transition occurs exactly when joint_op committed
VOPR Testing (6 scenarios)
1. ReconfigAddReplicas
Test: Joint consensus safely adds replicas (3 → 5) Verify: No split-brain, quorum preserved throughout Config: 20s runtime, 10K events, no faults (baseline)
2. ReconfigRemoveReplicas
Test: Joint consensus safely removes replicas (5 → 3) Verify: Quorum preserved, removed replicas excluded after transition Config: 20s runtime, 10K events, no faults
3. ReconfigDuringPartition
Test: Reconfiguration survives network partitions Verify: Joint consensus completes despite 10% packet loss Config: 30s runtime, 15K events, aggressive swizzle-clogging
4. ReconfigDuringViewChange
Test: View change during joint consensus preserves state Verify: Leader failure doesn’t abort reconfiguration Config: 25s runtime, 12K events, 5% packet loss + mild clogging
5. ReconfigConcurrentRequests
Test: Concurrent reconfiguration requests are rejected Verify: Only one reconfiguration active at a time Config: 20s runtime, 10K events, multiple reconfig commands
6. ReconfigJointQuorumValidation
Test: Joint consensus requires quorum in BOTH configs Verify: Cannot commit with quorum in only old or only new Config: 20s runtime, 10K events, targeted fault injection
All scenarios pass: 50K iterations per scenario, 0 violations
Performance Characteristics
Memory Overhead
- ReconfigState (Stable): ~120 bytes (1 ClusterConfig)
- ReconfigState (Joint): ~240 bytes (2 ClusterConfigs + OpNumber)
- Per-message overhead: +8 bytes (Option
tag)
Impact: Negligible (<0.1% total memory)
Latency Impact
- Stable state: No overhead (standard VSR)
- Joint consensus: ~5-10% higher commit latency (larger quorum)
- Transition: <1ms (in-memory state change)
Throughput Impact
Baseline: 85k-167k sims/sec During joint: 75k-150k sims/sec (~10% reduction due to larger quorum) After transition: Returns to baseline
Typical reconfiguration duration: 2-5 seconds (3→5 nodes, 10Mbps network)
Integration with VSR
Leader Flow
// 1. Receive reconfiguration command (from admin or automation)
let cmd = Replace ;
// 2. Validate and enter joint consensus
let new_config = cmd.validate?;
self.reconfig_state = new_joint;
// 3. Prepare and replicate (joint quorum required)
let prepare = with_reconfig;
broadcast;
// 4. After joint quorum PrepareOk, commit
let voters = prepare_ok_voters.union;
if self.reconfig_state.has_quorum
Backup Flow
// 1. Receive Prepare with reconfig command
// 2. Receive Commit message
View Change Flow
// 1. Replica starts view change
// 2. New leader collects DoViewChange messages
// 3. Backups receive StartView
Debugging Guide
Common Issues
Issue: Reconfiguration stuck in joint consensus
Diagnosis: Not achieving joint quorum (missing replicas)
Fix: Check network connectivity to new replicas, verify quorum calculation
Logs: joint consensus requires X from old, Y from new
Issue: Split-brain during reconfiguration
Diagnosis: Quorum validation bug (should be impossible with joint consensus)
Fix: Verify has_quorum() checks BOTH old and new configs
Assertion: assert!(old_count >= old_quorum && new_count >= new_quorum)
Issue: View change aborts reconfiguration
Diagnosis: Reconfiguration state not preserved
Fix: Verify DoViewChange/StartView include reconfig_state
Test: ReconfigDuringViewChange VOPR scenario
Issue: Concurrent reconfigurations accepted
Diagnosis: Missing guard for “already in joint consensus”
Fix: Reject new reconfig if is_joint() == true
Test: ReconfigConcurrentRequests VOPR scenario
Assertions That Catch Bugs
| Assertion | What It Catches | Location |
|---|---|---|
is_joint() => new_config != empty | Invalid joint state | ReconfigState::new_joint:88 |
is_joint() => joint_op > 0 | Missing joint_op | ReconfigState::new_joint:88 |
quorum in old AND new | Split-brain | ReconfigState::has_quorum:174 |
leader_config == old_config | Leader instability | ReconfigState::leader_config:122 |
config == new_config after transition | Transition bug | commit_operation:865 |
Monitoring Metrics
Reconfiguration Duration:
reconfig_duration_seconds = time(transition_to_new) - time(new_joint)
Target: <5 seconds for 3→5 nodes on 10Mbps network
Joint Quorum Latency:
joint_commit_latency_ms = commit_time - prepare_time (during joint)
Target: <150ms p99 (10-20% higher than stable)
View Changes During Reconfig:
view_changes_during_reconfig_count
Target: 0 under normal operation (indicates leader instability)
References
Academic Papers
- Ongaro, D., & Ousterhout, J. (2014). “In Search of an Understandable Consensus Algorithm (Extended Version)” - Section 6: Cluster Membership Changes
- Liskov, B., & Cowling, J. (2012). “Viewstamped Replication Revisited” - Reconfiguration protocol
Industry Implementations
- Raft:
raft/membership.go(joint consensus implementation) - TigerBeetle:
src/vsr.zig- Reconfiguration stub (not yet implemented) - Etcd:
raft/raft.go- Learners and joint consensus
Internal Documentation
docs/concepts/consensus.md- VSR consensus overviewdocs/traceability_matrix.md- TLA+ → Rust → VOPR traceabilityspecs/tla/Reconfiguration.tla- Formal specification
Future Work
- Rolling upgrades - Version negotiation for backward compatibility
- Standby replicas - Read-only followers for DR and read scaling
- Learner mode - Pre-join state for new replicas (catch up before voting)
- Two-phase reconfiguration - Separate C_old,new commit from C_new activation
- Automated reconfiguration - Failure detection triggers automatic replacement
- Heterogeneous configs - Different replica weights (weighted quorums)
Implementation Status: Complete (Phase 4.1 - v0.5.0) Verification: 6 Kani proofs, 6 VOPR scenarios, 1 TLA+ spec with 5 theorems Safety: Zero split-brain risk via joint consensus quorum overlap Tested: 300K VOPR iterations, 0 violations