Cluster Reconfiguration Architecture Design | Kimberlite Docs

On this page

Overview
Goals
Non-Goals (Future Work)
Background: Joint Consensus
Why Joint Consensus?
Raft’s Joint Consensus Solution
Kimberlite VSR Adaptation
State Machine
Reconfiguration Commands
Quorum Calculation
Message Protocol Extensions
Integration with ReplicaState
View Change Integration
Leader Election with Reconfiguration
Safety Invariants
Invariant 1: Single Reconfiguration
Invariant 2: Quorum Intersection
Invariant 3: Configuration Validity
Invariant 4: Monotonic Progress
Timeout Handling
Reconfiguration Timeout
Example Scenarios
Scenario 1: Add Replica (3 → 5 nodes)
Scenario 2: Remove Replica (5 → 3 nodes)
Scenario 3: View Change During Reconfiguration
Implementation Plan
Phase 4.1: Core Reconfiguration (~600 LOC)
Phase 4.2: Command Processing (~400 LOC)
Phase 4.3: Testing (~500 LOC)
Open Questions
Q1: Should we support removing the current leader?
Q2: How to handle new replicas that are far behind?
Q3: Can we abort a reconfiguration mid-flight?
Comparison with TigerBeetle
References

Date: 2026-02-05 Status: Design Phase Target: Phase 4 - Cluster Operations

Overview

This document specifies the architecture for VSR cluster reconfiguration, enabling zero-downtime addition and removal of replicas. The design is based on Raft’s joint consensus algorithm, adapted for Viewstamped Replication.

Goals

Zero-Downtime Reconfiguration - Add/remove replicas without service interruption
Safety Preservation - Never violate VSR safety guarantees during transitions
Progress Guarantee - Reconfigurations always complete or safely abort
Simple API - Easy-to-use reconfiguration commands

Non-Goals (Future Work)

Simultaneous multiple reconfigurations (only one at a time)
Automatic cluster scaling based on load
Cross-datacenter replication topology
Dynamic leader rebalancing

Background: Joint Consensus

Why Joint Consensus?

Naive approach (direct switch from Config_old to Config_new) is unsafe:

During transition, two disjoint quorums can form
Split-brain scenario: both can commit conflicting operations
Example: 3-node cluster → 5-node cluster
- Old quorum: 2 of {A,B,C}
- New quorum: 3 of {A,B,C,D,E}
- If only A,B,C are online, they form old quorum
- If D,E come online before A,B,C update, they can’t form new quorum
- But if some nodes use old config and others use new, split-brain!

Raft’s Joint Consensus Solution

Three-state transition:

C_old (Stable) - All nodes use old configuration
C_old,new (Joint) - Nodes use BOTH configurations, require quorum in BOTH
C_new (Stable) - All nodes use new configuration

Key Invariant: During joint consensus, operations require quorum in BOTH old and new configurations. This prevents split-brain because no disjoint quorums can form.

Transition Protocol:

C_old  --(propose C_old,new)-->  C_old,new  --(commit C_old,new)-->  C_new
                                    ▲
                                    │
                            Quorum in BOTH configs

Leader in C_old proposes C_old,new
C_old,new gets committed (requires quorum in C_old AND C_new)
Once C_old,new committed, automatically transition to C_new
C_new becomes the new stable configuration

Kimberlite VSR Adaptation

State Machine

#[derive(Debug, Clone, PartialEq, Eq)]
pub enum ReconfigState {
    /// Stable state with single configuration
    Stable {
        config: ClusterConfig,
    },

    /// Joint consensus state with two configurations
    Joint {
        old_config: ClusterConfig,
        new_config: ClusterConfig,
        /// OpNumber where C_old,new was committed
        joint_op: OpNumber,
    },
}

States:

Stable: Normal operation, single configuration
Joint: Temporary transition state, dual configurations

Reconfiguration Commands

#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
pub enum ReconfigCommand {
    /// Add a new replica to the cluster
    AddReplica(ReplicaId),

    /// Remove a replica from the cluster
    RemoveReplica(ReplicaId),

    /// Replace multiple replicas (add + remove atomically)
    Replace {
        add: Vec<ReplicaId>,
        remove: Vec<ReplicaId>,
    },
}

Command Processing:

Validation - Check command is safe (no duplicates, odd cluster size, etc.)
Propose C_old,new - Create joint configuration, propose as Prepare
Joint Consensus - Wait for commit with quorum in BOTH configs
Automatic Transition - Once joint op committed, switch to C_new
Stable - Resume normal operation with new configuration

Quorum Calculation

impl ReconfigState {
    /// Calculates quorum for the current state
    pub fn quorum_size(&self) -> usize {
        match self {
            ReconfigState::Stable { config } => config.quorum_size(),
            ReconfigState::Joint { old_config, new_config, .. } => {
                // Joint consensus: quorum in BOTH configs
                std::cmp::max(old_config.quorum_size(), new_config.quorum_size())
            }
        }
    }

    /// Checks if a set of replicas forms a quorum
    pub fn has_quorum(&self, replicas: &[ReplicaId]) -> bool {
        match self {
            ReconfigState::Stable { config } => {
                let count = replicas.iter().filter(|r| config.contains(**r)).count();
                count >= config.quorum_size()
            }
            ReconfigState::Joint { old_config, new_config, .. } => {
                let old_count = replicas.iter().filter(|r| old_config.contains(**r)).count();
                let new_count = replicas.iter().filter(|r| new_config.contains(**r)).count();

                old_count >= old_config.quorum_size() &&
                new_count >= new_config.quorum_size()
            }
        }
    }
}

Key Insight: Joint consensus quorum = MAX(Q_old, Q_new), and BOTH must be satisfied.

Message Protocol Extensions

Reconfiguration Message

New message type for reconfiguration proposals:

pub enum MessageKind {
    // ... existing messages ...

    /// Reconfiguration proposal (only from leader)
    Reconfiguration,
}

pub struct ReconfigurationMessage {
    pub view: ViewNumber,
    pub op: OpNumber,
    pub command: ReconfigCommand,
    pub old_config: ClusterConfig,
    pub new_config: ClusterConfig,
}

Prepare Message Extension

Existing Prepare messages carry reconfiguration data:

pub struct PrepareMessage {
    pub view: ViewNumber,
    pub op: OpNumber,
    pub command: Command,

    // NEW: Optional reconfiguration
    pub reconfig: Option<ReconfigCommand>,
}

Why extend Prepare? Reconfiguration is a special operation that goes through the normal Prepare → PrepareOK → Commit flow, ensuring it’s durably replicated before taking effect.

Integration with ReplicaState

State Extension

pub struct ReplicaState {
    // ... existing fields ...

    /// Current reconfiguration state
    pub(crate) reconfig_state: ReconfigState,

    /// Pending reconfiguration (if any)
    pub(crate) pending_reconfig: Option<ReconfigCommand>,
}

Event Handling

pub enum ReplicaEvent {
    // ... existing events ...

    /// Reconfiguration command from client/admin
    ReconfigCommand(ReconfigCommand),
}

impl ReplicaState {
    fn on_reconfig_command(mut self, cmd: ReconfigCommand) -> (Self, ReplicaOutput) {
        // Only leader can process reconfigurations
        if !self.is_leader() {
            return (self, ReplicaOutput::error("not leader"));
        }

        // Only one reconfiguration at a time
        if !matches!(self.reconfig_state, ReconfigState::Stable { .. }) {
            return (self, ReplicaOutput::error("reconfiguration in progress"));
        }

        // Validate command
        let new_config = match self.validate_reconfig(&cmd) {
            Ok(cfg) => cfg,
            Err(e) => return (self, ReplicaOutput::error(e)),
        };

        // Transition to joint consensus
        let old_config = self.config.clone();
        self.reconfig_state = ReconfigState::Joint {
            old_config: old_config.clone(),
            new_config: new_config.clone(),
            joint_op: self.op_number.next(),
        };

        // Propose C_old,new as Prepare
        self.propose_with_reconfig(cmd)
    }
}

View Change Integration

Critical Question: What happens if a view change occurs during reconfiguration?

Answer: Joint consensus persists across view changes.

impl ReplicaState {
    fn on_do_view_change(&mut self, msg: DoViewChangeMessage) {
        // ... existing view change logic ...

        // Preserve reconfiguration state across view changes
        // The new leader inherits the joint consensus state
        if let Some(reconfig) = msg.reconfig_state {
            self.reconfig_state = reconfig;
        }
    }
}

Key Points:

Joint consensus state is included in DoViewChange messages
New leader inherits the reconfiguration state
If C_old,new was proposed but not committed, new leader continues
If C_old,new was committed, new leader completes transition to C_new

Leader Election with Reconfiguration

Question: Which configuration determines the leader during joint consensus?

Answer: Use the OLD configuration for leader election during joint consensus.

impl ReconfigState {
    /// Returns the configuration to use for leader election
    pub fn leader_config(&self) -> &ClusterConfig {
        match self {
            ReconfigState::Stable { config } => config,
            ReconfigState::Joint { old_config, .. } => old_config,
        }
    }
}

Rationale:

Ensures leader election remains stable during transition
Avoids flip-flopping leadership if new replicas aren’t ready
Once C_new is stable, leadership can rotate to include new replicas

Safety Invariants

Invariant 1: Single Reconfiguration

Property: At most one reconfiguration is in progress at any time.

Enforcement:

on_reconfig_command() rejects new reconfigurations if not in Stable state
Joint consensus must complete before new reconfiguration can start

Invariant 2: Quorum Intersection

Property: Any two quorums (old, new, or joint) always intersect.

Proof:

In Stable: Standard quorum majority (n/2 + 1)
In Joint: Requires quorum in BOTH old and new
- Q_old ≥ |C_old|/2 + 1
- Q_new ≥ |C_new|/2 + 1
- Any Q_old and Q_new must intersect within C_old ∩ C_new

Invariant 3: Configuration Validity

Property: All configurations maintain odd cluster size and no duplicates.

Enforcement:

fn validate_reconfig(&self, cmd: &ReconfigCommand) -> Result<ClusterConfig> {
    let new_replicas = match cmd {
        ReconfigCommand::AddReplica(id) => {
            let mut replicas: Vec<_> = self.config.replicas().collect();
            if replicas.contains(id) {
                return Err("replica already in cluster");
            }
            replicas.push(*id);
            replicas
        }
        ReconfigCommand::RemoveReplica(id) => {
            let mut replicas: Vec<_> = self.config.replicas().collect();
            if !replicas.contains(id) {
                return Err("replica not in cluster");
            }
            replicas.retain(|r| r != id);
            replicas
        }
        // ... Replace case ...
    };

    // Validate odd size
    if new_replicas.len() % 2 == 0 {
        return Err("cluster size must be odd");
    }

    Ok(ClusterConfig::new(new_replicas))
}

Invariant 4: Monotonic Progress

Property: Once C_old,new is committed, the system always transitions to C_new.

Enforcement:

Commit handler automatically transitions to C_new when joint op committed
Recovery and view change preserve joint state until transition completes

Timeout Handling

Reconfiguration Timeout

New timeout type for detecting stuck reconfigurations:

pub enum TimeoutKind {
    // ... existing timeouts ...

    /// Reconfiguration timeout (detects stuck reconfigurations)
    ReconfigTimeout,
}

Behavior:

If joint consensus doesn’t complete within timeout, abort and revert to C_old
Leader retries reconfiguration proposal
After max retries, give up and require manual intervention

Configuration:

pub struct TimeoutConfig {
    // ... existing timeouts ...

    /// Time to wait for reconfiguration to complete
    pub reconfig_timeout: Duration,  // Default: 30 seconds
}

Example Scenarios

Scenario 1: Add Replica (3 → 5 nodes)

Initial: C_old = {A, B, C} (quorum = 2)

Steps:

Admin sends ReconfigCommand::AddReplica(D) to leader A
A validates: C_new = {A, B, C, D, E} would be even → REJECT
Admin sends ReconfigCommand::Replace { add: [D, E], remove: [] }
A validates: C_new = {A, B, C, D, E} (quorum = 3) → OK
A proposes C_old,new at op 100
Joint consensus: Need 2 acks from {A,B,C} AND 3 acks from {A,B,C,D,E}
- Effectively need 3 acks from {A,B,C} since D,E might not be ready
Once op 100 committed, automatically transition to C_new
New stable state: C_new = {A, B, C, D, E}

Scenario 2: Remove Replica (5 → 3 nodes)

Initial: C_old = {A, B, C, D, E} (quorum = 3)

Steps:

Admin sends ReconfigCommand::Replace { add: [], remove: [D, E] }
Leader proposes C_old,new = {A,B,C,D,E} → {A,B,C}
Joint consensus: Need 3 acks from {A,B,C,D,E} AND 2 acks from {A,B,C}
Once committed, transition to C_new = {A,B,C}
D and E can be decommissioned safely

Scenario 3: View Change During Reconfiguration

Initial: C_old = {A, B, C}, leader A proposes C_old,new to add D,E

Failure: Leader A crashes before C_old,new commits

Recovery:

B starts view change (view 1)
B collects DoViewChange messages from C_old (need quorum of 2)
DoViewChange messages include reconfig_state = Joint
B becomes new leader in view 1, inherits Joint state
B re-proposes C_old,new at next op
Joint consensus continues with B as leader
Once committed, transition to C_new

Key: Reconfiguration survives view changes.

Implementation Plan

Phase 4.1: Core Reconfiguration (~600 LOC)

reconfiguration.rs - State machine, validation, quorum calculation
message.rs - Extend Prepare with reconfig field
replica/state.rs - Add reconfig_state field, integrate quorum calculation

Phase 4.2: Command Processing (~400 LOC)

replica/normal.rs - Implement on_reconfig_command()
replica/view_change.rs - Extend DoViewChange with reconfig state
config.rs - Add reconfig validation helpers

Phase 4.3: Testing (~500 LOC)

Unit tests - State transitions, quorum calculation, validation
Integration tests - Add/remove replica scenarios
VOPR scenarios - Reconfiguration under faults

Open Questions

Q1: Should we support removing the current leader?

Answer: YES, but with automatic leader transfer.

Approach:

If RemoveReplica(leader), leader initiates view change before proposing
New leader (not being removed) completes reconfiguration

Q2: How to handle new replicas that are far behind?

Answer: Two-phase approach:

Phase 1: Catch-up - New replica added as “standby” (read-only)
Phase 2: Promotion - Once caught up, promote to full voting member

Extension: Add StandbyReplica state (implemented separately in standby.rs)

Q3: Can we abort a reconfiguration mid-flight?

Answer: Only if C_old,new hasn’t been committed yet.

Approach:

Before commit: Leader can abort, revert to C_old
After commit: No abort, must complete transition to C_new

Implementation: Add ReconfigCommand::Abort (future work)

Comparison with TigerBeetle

According to the plan, TigerBeetle has “ Partial support” for reconfiguration. Kimberlite’s full implementation will provide:

Complete joint consensus - Full Raft-style safety
Integration with VSR - View changes preserve reconfiguration
Comprehensive testing - VOPR scenarios for all edge cases

References

Raft Paper - “In Search of an Understandable Consensus Algorithm” (Section 6: Cluster membership changes)
VRR Paper - “Viewstamped Replication Revisited” (Lamport, Schneider)
TigerBeetle Documentation - Cluster reconfiguration notes
Kimberlite VSR Implementation - Existing view change protocol

Status: Design Complete Next Step: Implementation (Task #2 - Implement reconfiguration state machine)