Cluster Reconfiguration Architecture Design
On this page
- Overview
- Goals
- Non-Goals (Future Work)
- Background: Joint Consensus
- Why Joint Consensus?
- Raft’s Joint Consensus Solution
- Kimberlite VSR Adaptation
- State Machine
- Reconfiguration Commands
- Quorum Calculation
- Message Protocol Extensions
- Integration with ReplicaState
- View Change Integration
- Leader Election with Reconfiguration
- Safety Invariants
- Invariant 1: Single Reconfiguration
- Invariant 2: Quorum Intersection
- Invariant 3: Configuration Validity
- Invariant 4: Monotonic Progress
- Timeout Handling
- Reconfiguration Timeout
- Example Scenarios
- Scenario 1: Add Replica (3 → 5 nodes)
- Scenario 2: Remove Replica (5 → 3 nodes)
- Scenario 3: View Change During Reconfiguration
- Implementation Plan
- Phase 4.1: Core Reconfiguration (~600 LOC)
- Phase 4.2: Command Processing (~400 LOC)
- Phase 4.3: Testing (~500 LOC)
- Open Questions
- Q1: Should we support removing the current leader?
- Q2: How to handle new replicas that are far behind?
- Q3: Can we abort a reconfiguration mid-flight?
- Comparison with TigerBeetle
- References
Date: 2026-02-05 Status: Design Phase Target: Phase 4 - Cluster Operations
Overview
This document specifies the architecture for VSR cluster reconfiguration, enabling zero-downtime addition and removal of replicas. The design is based on Raft’s joint consensus algorithm, adapted for Viewstamped Replication.
Goals
- Zero-Downtime Reconfiguration - Add/remove replicas without service interruption
- Safety Preservation - Never violate VSR safety guarantees during transitions
- Progress Guarantee - Reconfigurations always complete or safely abort
- Simple API - Easy-to-use reconfiguration commands
Non-Goals (Future Work)
- Simultaneous multiple reconfigurations (only one at a time)
- Automatic cluster scaling based on load
- Cross-datacenter replication topology
- Dynamic leader rebalancing
Background: Joint Consensus
Why Joint Consensus?
Naive approach (direct switch from Config_old to Config_new) is unsafe:
- During transition, two disjoint quorums can form
- Split-brain scenario: both can commit conflicting operations
- Example: 3-node cluster → 5-node cluster
- Old quorum: 2 of {A,B,C}
- New quorum: 3 of {A,B,C,D,E}
- If only A,B,C are online, they form old quorum
- If D,E come online before A,B,C update, they can’t form new quorum
- But if some nodes use old config and others use new, split-brain!
Raft’s Joint Consensus Solution
Three-state transition:
- C_old (Stable) - All nodes use old configuration
- C_old,new (Joint) - Nodes use BOTH configurations, require quorum in BOTH
- C_new (Stable) - All nodes use new configuration
Key Invariant: During joint consensus, operations require quorum in BOTH old and new configurations. This prevents split-brain because no disjoint quorums can form.
Transition Protocol:
C_old --(propose C_old,new)--> C_old,new --(commit C_old,new)--> C_new
▲
│
Quorum in BOTH configs
- Leader in C_old proposes C_old,new
- C_old,new gets committed (requires quorum in C_old AND C_new)
- Once C_old,new committed, automatically transition to C_new
- C_new becomes the new stable configuration
Kimberlite VSR Adaptation
State Machine
States:
- Stable: Normal operation, single configuration
- Joint: Temporary transition state, dual configurations
Reconfiguration Commands
Command Processing:
- Validation - Check command is safe (no duplicates, odd cluster size, etc.)
- Propose C_old,new - Create joint configuration, propose as Prepare
- Joint Consensus - Wait for commit with quorum in BOTH configs
- Automatic Transition - Once joint op committed, switch to C_new
- Stable - Resume normal operation with new configuration
Quorum Calculation
Key Insight: Joint consensus quorum = MAX(Q_old, Q_new), and BOTH must be satisfied.
Message Protocol Extensions
Reconfiguration Message
New message type for reconfiguration proposals:
Prepare Message Extension
Existing Prepare messages carry reconfiguration data:
Why extend Prepare? Reconfiguration is a special operation that goes through the normal Prepare → PrepareOK → Commit flow, ensuring it’s durably replicated before taking effect.
Integration with ReplicaState
State Extension
Event Handling
View Change Integration
Critical Question: What happens if a view change occurs during reconfiguration?
Answer: Joint consensus persists across view changes.
Key Points:
- Joint consensus state is included in DoViewChange messages
- New leader inherits the reconfiguration state
- If C_old,new was proposed but not committed, new leader continues
- If C_old,new was committed, new leader completes transition to C_new
Leader Election with Reconfiguration
Question: Which configuration determines the leader during joint consensus?
Answer: Use the OLD configuration for leader election during joint consensus.
Rationale:
- Ensures leader election remains stable during transition
- Avoids flip-flopping leadership if new replicas aren’t ready
- Once C_new is stable, leadership can rotate to include new replicas
Safety Invariants
Invariant 1: Single Reconfiguration
Property: At most one reconfiguration is in progress at any time.
Enforcement:
on_reconfig_command()rejects new reconfigurations if not in Stable state- Joint consensus must complete before new reconfiguration can start
Invariant 2: Quorum Intersection
Property: Any two quorums (old, new, or joint) always intersect.
Proof:
- In Stable: Standard quorum majority (n/2 + 1)
- In Joint: Requires quorum in BOTH old and new
- Q_old ≥ |C_old|/2 + 1
- Q_new ≥ |C_new|/2 + 1
- Any Q_old and Q_new must intersect within C_old ∩ C_new
Invariant 3: Configuration Validity
Property: All configurations maintain odd cluster size and no duplicates.
Enforcement:
Invariant 4: Monotonic Progress
Property: Once C_old,new is committed, the system always transitions to C_new.
Enforcement:
- Commit handler automatically transitions to C_new when joint op committed
- Recovery and view change preserve joint state until transition completes
Timeout Handling
Reconfiguration Timeout
New timeout type for detecting stuck reconfigurations:
Behavior:
- If joint consensus doesn’t complete within timeout, abort and revert to C_old
- Leader retries reconfiguration proposal
- After max retries, give up and require manual intervention
Configuration:
Example Scenarios
Scenario 1: Add Replica (3 → 5 nodes)
Initial: C_old = {A, B, C} (quorum = 2)
Steps:
- Admin sends
ReconfigCommand::AddReplica(D)to leader A - A validates: C_new = {A, B, C, D, E} would be even → REJECT
- Admin sends
ReconfigCommand::Replace { add: [D, E], remove: [] } - A validates: C_new = {A, B, C, D, E} (quorum = 3) → OK
- A proposes C_old,new at op 100
- Joint consensus: Need 2 acks from {A,B,C} AND 3 acks from {A,B,C,D,E}
- Effectively need 3 acks from {A,B,C} since D,E might not be ready
- Once op 100 committed, automatically transition to C_new
- New stable state: C_new = {A, B, C, D, E}
Scenario 2: Remove Replica (5 → 3 nodes)
Initial: C_old = {A, B, C, D, E} (quorum = 3)
Steps:
- Admin sends
ReconfigCommand::Replace { add: [], remove: [D, E] } - Leader proposes C_old,new = {A,B,C,D,E} → {A,B,C}
- Joint consensus: Need 3 acks from {A,B,C,D,E} AND 2 acks from {A,B,C}
- Once committed, transition to C_new = {A,B,C}
- D and E can be decommissioned safely
Scenario 3: View Change During Reconfiguration
Initial: C_old = {A, B, C}, leader A proposes C_old,new to add D,E
Failure: Leader A crashes before C_old,new commits
Recovery:
- B starts view change (view 1)
- B collects DoViewChange messages from C_old (need quorum of 2)
- DoViewChange messages include reconfig_state = Joint
- B becomes new leader in view 1, inherits Joint state
- B re-proposes C_old,new at next op
- Joint consensus continues with B as leader
- Once committed, transition to C_new
Key: Reconfiguration survives view changes.
Implementation Plan
Phase 4.1: Core Reconfiguration (~600 LOC)
- reconfiguration.rs - State machine, validation, quorum calculation
- message.rs - Extend Prepare with reconfig field
- replica/state.rs - Add reconfig_state field, integrate quorum calculation
Phase 4.2: Command Processing (~400 LOC)
- replica/normal.rs - Implement on_reconfig_command()
- replica/view_change.rs - Extend DoViewChange with reconfig state
- config.rs - Add reconfig validation helpers
Phase 4.3: Testing (~500 LOC)
- Unit tests - State transitions, quorum calculation, validation
- Integration tests - Add/remove replica scenarios
- VOPR scenarios - Reconfiguration under faults
Open Questions
Q1: Should we support removing the current leader?
Answer: YES, but with automatic leader transfer.
Approach:
- If RemoveReplica(leader), leader initiates view change before proposing
- New leader (not being removed) completes reconfiguration
Q2: How to handle new replicas that are far behind?
Answer: Two-phase approach:
- Phase 1: Catch-up - New replica added as “standby” (read-only)
- Phase 2: Promotion - Once caught up, promote to full voting member
Extension: Add StandbyReplica state (implemented separately in standby.rs)
Q3: Can we abort a reconfiguration mid-flight?
Answer: Only if C_old,new hasn’t been committed yet.
Approach:
- Before commit: Leader can abort, revert to C_old
- After commit: No abort, must complete transition to C_new
Implementation: Add ReconfigCommand::Abort (future work)
Comparison with TigerBeetle
According to the plan, TigerBeetle has “ Partial support” for reconfiguration. Kimberlite’s full implementation will provide:
- Complete joint consensus - Full Raft-style safety
- Integration with VSR - View changes preserve reconfiguration
- Comprehensive testing - VOPR scenarios for all edge cases
References
- Raft Paper - “In Search of an Understandable Consensus Algorithm” (Section 6: Cluster membership changes)
- VRR Paper - “Viewstamped Replication Revisited” (Lamport, Schneider)
- TigerBeetle Documentation - Cluster reconfiguration notes
- Kimberlite VSR Implementation - Existing view change protocol
Status: Design Complete Next Step: Implementation (Task #2 - Implement reconfiguration state machine)