Standby Replicas (Phase 4.3)
On this page
- Table of Contents
- Overview
- Key Properties
- Formal Guarantees
- Architecture
- Message Flow
- Use Cases
- 1. Disaster Recovery (Geographic Redundancy)
- 2. Read Scaling (Offload Queries)
- 3. Testing/Staging Environments
- Promotion Procedures
- Requirements for Promotion
- Automatic Promotion (Disaster Recovery)
- Manual Promotion (Operator-Initiated)
- Promotion Safety (Kani Proof #69)
- Configuration
- Creating a Standby Replica
- Cluster Configuration
- Network Configuration
- Monitoring
- Key Metrics
- Prometheus Metrics
- Alerting Rules
- Formal Verification
- Kani Proof #68: Standby Never Participates in Quorum
- Kani Proof #69: Promotion Preserves Log Consistency
- VOPR Scenarios
- Scenario 1: Standby Follows Log
- Scenario 2: Standby Promotion
- Scenario 3: Read Scaling
- Troubleshooting
- Problem: Standby Lag Increasing
- Problem: Promotion Ineligible
- Best Practices
- References
Status: Implemented (v0.4.0) Formal Verification: 2 Kani proofs, 3 VOPR scenarios Purpose: Disaster recovery and read scaling without affecting quorum
Table of Contents
- Overview
- Architecture
- Use Cases
- Promotion Procedures
- Configuration
- Monitoring
- Formal Verification
- VOPR Scenarios
Overview
Standby replicas are read-only followers that receive log updates from active replicas but do NOT participate in quorum decisions. This provides:
- Geographic redundancy (disaster recovery) without affecting quorum size
- Read scaling (offload queries) without adding active replicas
- Testing/staging environments that mirror production safely
Key Properties
- Quorum independence: Standby replicas are NOT counted in quorum calculations
- Eventually consistent reads: Standby replicas may lag behind committed operations
- Safe promotion: Standby replicas can be promoted to active status via cluster reconfiguration
- Zero overhead: Active cluster performance is unaffected by standby replicas
Formal Guarantees
- Kani Proof #68: Standby replicas NEVER send
PrepareOKmessages (no quorum participation) - Kani Proof #69: Promotion to active status preserves log consistency (no divergence)
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Active Cluster (3 replicas) │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Primary │ │ Backup₁ │ │ Backup₂ │ │
│ │ (R0) │ │ (R1) │ │ (R2) │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ │◄─── Quorum ────►│◄─── Quorum ────►│ │
│ │ (2/3) │ (2/3) │ │
└───────┼─────────────────┼─────────────────┼──────────────────────┘
│ │ │
│ │ │
│ Prepare │ Prepare │ Prepare
│ Commit │ Commit │ Commit
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Standby │ │ Standby │ │ Standby │
│ (S0) │ │ (S1) │ │ (S2) │
│ DR │ │ Reads │ │ Reads │
└─────────┘ └─────────┘ └─────────┘
│ │ │
│ │ │
▼ ▼ ▼
Read-only Read-only Read-only
Queries Queries Queries
(eventual (eventual (eventual
consistency) consistency) consistency)
NOT counted in quorum (Kani Proof #68)
Message Flow
- Active Cluster: Primary replicates operations to backups
- Prepare Messages: Standby replicas receive
Preparemessages and append to log - NO PrepareOK: Standby replicas do NOT send
PrepareOK(Kani Proof #68) - Commit Messages: Standby replicas receive
Commitmessages and update commit number - Read Queries: Standby replicas serve eventually consistent reads from their log
Use Cases
1. Disaster Recovery (Geographic Redundancy)
Problem: Need geographic redundancy without requiring cross-datacenter quorum (high latency).
Solution: Deploy standby replicas in remote datacenter.
Primary Datacenter (US East) Disaster Recovery (US West)
┌────────────────────────┐ ┌────────────────────────┐
│ Active Cluster (3x) │ │ Standby (1-2x) │
│ - Quorum = 2/3 │────────►│ - NOT in quorum │
│ - Low latency │ Async │ - High availability │
└────────────────────────┘ └────────────────────────┘
If US East fails:
1. Promote US West standby to active (manual or automatic)
2. Cluster reconfiguration (joint consensus)
3. US West becomes new active cluster
Benefits:
- Geographic redundancy without quorum latency
- Promotion time: ~seconds (reconfiguration only)
- Data loss: minimal (standby lags by ~milliseconds in normal operation)
2. Read Scaling (Offload Queries)
Problem: High read workload affecting active cluster performance.
Solution: Route read queries to standby replicas.
┌────────────┐
│ Primary │ (Writes)
│ (Active) │
└──────┬─────┘
│
┌──────┴──────┐
│ │
┌───▼───┐ ┌────▼───┐
│Backup1│ │Backup2 │ (Writes)
│Active │ │Active │
└───────┘ └────────┘
│
┌──────┴──────┬────────────────┬────────────────┐
│ │ │ │
┌───▼───┐ ┌────▼───┐ ┌─────▼───┐ ┌────▼────┐
│Standby│ │Standby │ │ Standby │ │ Standby │
│ (S0) │ │ (S1) │ │ (S2) │ │ (S3) │
└───┬───┘ └───┬────┘ └────┬────┘ └────┬────┘
│ │ │ │
▼ ▼ ▼ ▼
Read Read Read Read
Queries Queries Queries Queries
Load balancer distributes read traffic across standby replicas
Benefits:
- Horizontal read scaling (add standby replicas as needed)
- Active cluster unaffected (no quorum overhead)
- Eventually consistent reads (acceptable for analytics, dashboards)
3. Testing/Staging Environments
Problem: Need production-like data for testing without affecting production cluster.
Solution: Standby replica mirrors production data safely.
Production Cluster (Active) Staging Environment (Standby)
┌────────────────────────┐ ┌────────────────────────┐
│ Live customer traffic │────────►│ Testing/QA workload │
│ - Strict SLAs │ Async │ - No production impact │
│ - Compliance │ │ - Safe experimentation │
└────────────────────────┘ └────────────────────────┘
Benefits:
- Realistic testing data (mirrors production)
- Zero production impact (reads from standby)
- Safe for destructive tests (standby can be reset)
Promotion Procedures
Standby replicas can be promoted to active status when needed (e.g., disaster recovery, scaling active cluster).
Requirements for Promotion
- Log consistency: Standby log must match active primary (no divergence)
- Promotion eligibility: Standby must be marked
promotion_eligible(no gaps detected) - Cluster reconfiguration: Joint consensus to add standby to active config
Automatic Promotion (Disaster Recovery)
When an active replica fails and standby is up-to-date:
// 1. Detect active replica failure
if active_replica_down && standby.promotion_eligible
Manual Promotion (Operator-Initiated)
For planned operations (e.g., scaling active cluster):
# 1. Verify standby is up-to-date
# 2. Initiate promotion
# 3. Wait for joint consensus to complete
# 4. Verify promotion succeeded
# Expected: status=Normal (was Standby)
Promotion Safety (Kani Proof #69)
Property: Promotion preserves log consistency.
Verification:
- Standby log must be ⊆ active primary log (no divergence)
- Promotion only succeeds if
promotion_eligible = true - Gaps in log automatically mark standby as ineligible
Configuration
Creating a Standby Replica
use ;
// Create standby replica (not part of active cluster config)
let replica_id = from_raw; // Standby IDs typically 100+
let mut standby = new_standby;
// Verify status
assert!;
assert!;
Cluster Configuration
# kimberlite.toml
[cluster]
# Active replicas (participate in quorum)
replicas = ["R0", "R1", "R2"]
[standby]
# Standby replicas (read-only, NOT in quorum)
replicas = ["S0", "S1", "S2"]
# Promotion settings
auto_promote = true # Automatic promotion on active replica failure
max_lag_ms = 100 # Max acceptable lag before marking ineligible
Network Configuration
Standby replicas receive messages from active cluster:
firewall rules:
- Allow: Active → Standby (Prepare, Commit, Heartbeat)
- Deny: Standby → Active (PrepareOK) [never sent anyway]
- Allow: Clients → Standby (Read queries)
Monitoring
Key Metrics
Replication Lag:
// How far behind standby is from active cluster
let lag_ops = active_cluster.op_number - standby.op_number;
let lag_commits = active_cluster.commit_number - standby.commit_number;
Promotion Eligibility:
// Is standby safe to promote?
let eligible = standby.standby_state
.map
.unwrap_or;
Read Query Latency:
// p50, p99, p999 latencies for standby reads
// (eventual consistency may show higher variance)
Prometheus Metrics
# Replication lag (operations)
kimberlite_standby_lag_ops{replica_id="S0"} 5
# Replication lag (milliseconds)
kimberlite_standby_lag_ms{replica_id="S0"} 12
# Promotion eligibility
kimberlite_standby_promotion_eligible{replica_id="S0"} 1
# Read queries served
kimberlite_standby_queries_total{replica_id="S0"} 1234567
# Read query latency (p99)
kimberlite_standby_query_latency_p99{replica_id="S0"} 0.003
Alerting Rules
# Alert if standby lag exceeds threshold
- alert: StandbyLagHigh
expr: kimberlite_standby_lag_ms > 1000
for: 5m
annotations:
summary: "Standby {{$labels.replica_id}} lag exceeds 1s"
# Alert if standby becomes ineligible for promotion
- alert: StandbyPromotionIneligible
expr: kimberlite_standby_promotion_eligible == 0
annotations:
summary: "Standby {{$labels.replica_id}} cannot be promoted (log diverged)"
Formal Verification
Kani Proof #68: Standby Never Participates in Quorum
Property: Standby replicas NEVER send PrepareOK messages.
Why Critical: If standbys sent PrepareOK, they could affect quorum decisions, violating the safety property that only active replicas participate in consensus.
Verification:
Result: Verified (no counterexamples found)
Kani Proof #69: Promotion Preserves Log Consistency
Property: Promoting a standby to active preserves log consistency.
Why Critical: Promotion must not introduce divergent log entries that could violate consensus safety.
Verification:
Result: Verified (promotion safety guaranteed)
VOPR Scenarios
Scenario 1: Standby Follows Log
File: crates/kimberlite-sim/src/scenarios.rs → ScenarioType::StandbyFollowsLog
Tests:
- Standby receives
Preparemessages from active replicas - Standby appends entries to log but does NOT send
PrepareOK - Standby tracks
commit_numberfromCommitmessages - Standby never affects quorum decisions
Configuration:
- Network: 2% packet loss (normal conditions)
- Fault injection: None (baseline behavior)
- Duration: 25 seconds, 12K events
Expected Invariants:
- Standby log ⊆ active primary log (no divergence)
- Standby sends ZERO
PrepareOKmessages - Active cluster commits succeed without standby participation
Scenario 2: Standby Promotion
File: crates/kimberlite-sim/src/scenarios.rs → ScenarioType::StandbyPromotion
Tests:
- Standby is up-to-date (log matches active primary)
- Promotion via cluster reconfiguration (joint consensus)
- Promoted replica begins participating in quorum
- Log consistency preserved (Kani Proof #69)
Configuration:
- Network: 0% packet loss (clean promotion)
- Fault injection: None (planned operation)
- Duration: 30 seconds, 15K events
Expected Invariants:
- Promotion succeeds ONLY if
promotion_eligible = true - After promotion: status = Normal, standby_state = None
- Post-promotion commits include promoted replica in quorum
Scenario 3: Read Scaling
File: crates/kimberlite-sim/src/scenarios.rs → ScenarioType::StandbyReadScaling
Tests:
- Multiple standby replicas serve read queries
- Reads are eventually consistent (may lag)
- No impact on active cluster performance
- Load distributed across standbys
Configuration:
- Network: 3% packet loss (cross-datacenter)
- Fault injection: Gray failures (slow reads)
- Tenants: 3 (multi-tenant read workload)
- Duration: 40 seconds, 20K events
Expected Invariants:
- Read query throughput scales with standby count
- Active cluster commit latency unaffected (< 2% overhead)
- Standby reads serve stale data (eventual consistency)
Troubleshooting
Problem: Standby Lag Increasing
Symptoms:
kimberlite_standby_lag_ms> 1000ms- Standby falls behind active cluster
Causes:
- Network congestion: Slow link between active and standby
- Disk I/O: Standby storage too slow to keep up
- CPU contention: Standby overloaded with read queries
Resolution:
# 1. Check network latency
# 2. Check disk I/O
# 3. Reduce read query load (route to other standbys)
# 4. If persistent, promote to active (if eligible)
Problem: Promotion Ineligible
Symptoms:
kimberlite_standby_promotion_eligible = 0- Promotion fails with “log diverged” error
Causes:
- Gap detected: Standby missed
Preparemessages (packet loss) - Checksum mismatch: Corruption detected in standby log
Resolution:
# 1. Check standby log integrity
# 2. Repair log (fetch missing entries)
# 3. Verify promotion eligibility restored
# Expected: promotion_eligible=true
Best Practices
- Deploy standbys in pairs (2+ per region) for redundancy
- Monitor replication lag continuously (< 100ms ideal)
- Test promotion regularly (quarterly DR drills)
- Route read queries via load balancer (distribute load)
- Isolate standby I/O (dedicated storage, separate from active cluster)
References
- Architecture:
crates/kimberlite-vsr/src/replica/standby.rs(250 LOC) - Kani Proofs:
crates/kimberlite-vsr/src/replica/standby.rs(Proof #68, #69) - VOPR Scenarios:
crates/kimberlite-sim/src/scenarios.rs(StandbyFollowsLog, StandbyPromotion, StandbyReadScaling) - Type Definitions:
crates/kimberlite-vsr/src/types.rs(ReplicaStatus::Standby) - State Management:
crates/kimberlite-vsr/src/replica/state.rs(standby_state field)
See also:
- Cluster Reconfiguration - How promotion uses joint consensus
- Rolling Upgrades - Version compatibility for standbys
- VSR Protocol - Core replication protocol