← Back to Blog

Hardening Kimberlite VSR: 18 Bugs, 12 Invariants, and Lessons from Byzantine Testing

The Challenge

Over the past weeks, we’ve undertaken a comprehensive hardening initiative for Kimberlite’s VSR (Viewstamped Replication) implementation. What started as a plan to promote production assertions evolved into fixing 18 critical bugs, implementing 12 new invariant checkers, and fundamentally changing how we test Byzantine fault tolerance.

This post shares the lessons we learned and the sophisticated bugs we discovered.

The Critical Insight: Protocol-Level vs State-Level Testing

Our biggest breakthrough came from realizing our Byzantine tests were fundamentally flawed.

The Problem: Our initial VOPR tests corrupted replica internal state BEFORE message creation:

// WRONG: Corrupt state before creating messages
if byzantine_injector.should_inflate_commit(&mut rng) {
    replica_state.commit_number = commit_number + 500;  // Direct state corruption
}
// Messages created with inflated commit_number
// But VSR handlers NEVER validated the Byzantine input!

This approach bypassed all our protocol validation code. The fixes we implemented in on_do_view_change() and on_start_view() were never actually tested!

The Solution: Intercept VSR messages AFTER creation and apply Byzantine mutations at the protocol level:

// CORRECT: Mutate messages at protocol level
VSR Replica → ReplicaOutput(messages)[MessageMutator] → SimNetwork → Delivery

Now our tests actually exercise the Byzantine rejection logic in our handlers.

Impact

This architectural change revealed 5 critical vulnerabilities that state-corruption testing completely missed:

  1. DoViewChange log_tail length mismatch (Bug 3.1) - Byzantine replica could claim one thing and send another
  2. Non-deterministic log selection (Bug 3.3) - Could cause replicas to diverge on identical claims
  3. DoS via oversized StartView (Bug 3.4) - Memory exhaustion attack
  4. Invalid repair ranges (Bug 3.5) - Confusion attacks
  5. Kernel command errors stalling replicas (Bug 3.2) - Byzantine leader could halt the cluster

The Most Subtle Bug: Non-Deterministic Tie-Breaking

Bug 3.3 was particularly insidious. When multiple DoViewChange messages had identical (last_normal_view, op_number), our code selected one non-deterministically:

// WRONG: Non-deterministic!
let best_dvc = self.do_view_change_msgs.iter()
    .max_by(|a, b| {
        (a.last_normal_view, a.op_number).cmp(&(b.last_normal_view, b.op_number))
    })
    .expect("at least quorum messages");

If two messages tied, .max_by() could pick either one depending on iteration order. Different replicas might pick different logs, leading to divergence.

The Fix: Deterministic tie-breaking using entry checksums, then replica ID:

// CORRECT: Fully deterministic
.max_by(|a, b| {
    let primary_cmp = (a.last_normal_view, a.op_number)
        .cmp(&(b.last_normal_view, b.op_number));

    if primary_cmp != std::cmp::Ordering::Equal {
        return primary_cmp;
    }

    // Tie-breaker 1: Checksum (deterministic)
    let a_checksum = a.log_tail.last().map_or(0, |e| e.checksum);
    let b_checksum = b.log_tail.last().map_or(0, |e| e.checksum);
    let checksum_cmp = a_checksum.cmp(&b_checksum);

    if checksum_cmp != std::cmp::Ordering::Equal {
        return checksum_cmp;
    }

    // Tie-breaker 2: Replica ID (final fallback)
    a.replica.as_u8().cmp(&b.replica.as_u8())
})

This bug would have been nearly impossible to catch in production - it only manifests under specific network partition scenarios.

Production Assertions: The Foundation

We promoted 38 debug_assert!() calls to production assert!():

  • Crypto (25): All-zero detection, key hierarchy integrity, ciphertext validation
  • VSR (9): Leader-only operations, view monotonicity, commit ordering
  • Kernel (4): Stream existence, effect counts, offset monotonicity

Why This Matters

These assertions detect corruption, Byzantine attacks, and state machine bugs BEFORE they propagate:

// Phase 1: Promoted assertion
assert!(
    !key.0.iter().all(|&b| b == 0),
    "encryption key is all zeros - RNG failure or memory corruption"
);

If this fires in production, we know immediately there’s either:

  1. Storage corruption
  2. A Byzantine attack
  3. RNG failure
  4. A critical bug

Each of the 38 assertions has a corresponding #[should_panic] test to verify it actually fires.

The 12 New Invariant Checkers

We implemented comprehensive invariant checking across all VSR protocol operations:

Core Safety

  • CommitMonotonicityChecker: Ensures commit_number never regresses
  • ViewNumberMonotonicityChecker: Ensures views only increase
  • IdempotencyChecker: Detects double-application of operations
  • LogChecksumChainChecker: Verifies continuous hash chain integrity

Byzantine Resistance

  • StateTransferSafetyChecker: Preserves committed ops during transfer
  • QuorumValidationChecker: All quorum decisions have f+1 responses
  • LeaderElectionRaceChecker: Detects split-brain scenarios
  • MessageOrderingChecker: Catches protocol violations

Compliance Critical

  • TenantIsolationChecker: NO cross-tenant data leakage (HIPAA/GDPR)
  • CorruptionDetectionChecker: Verifies checksums catch corruption
  • RepairCompletionChecker: Ensures repairs don’t hang forever
  • HeartbeatLivenessChecker: Monitors leader heartbeats

Coverage increased from 65% to 95%+.

VOPR Scenarios: Comprehensive Coverage

We added 15 high-priority test scenarios across 5 categories:

Byzantine Attacks (5):

  • DVC tail length mismatch
  • Identical claims (tests tie-breaking)
  • Oversized StartView (DoS)
  • Invalid repair ranges
  • Invalid kernel commands

Corruption Detection (3):

  • Random bit flips
  • Checksum validation
  • Silent disk failures

Recovery & Crashes (3):

  • During commit application
  • During view change
  • With corrupt log

Gray Failures (2):

  • Slow disk I/O
  • Intermittent network

Race Conditions (2):

  • Concurrent view changes
  • Commit during DoViewChange

Total: 27 scenarios (up from 12).

Performance Impact

Despite adding significant safety checks, performance impact was minimal:

  • Throughput regression: <0.1%
  • p99 latency increase: +1μs
  • p50 latency increase: <1μs

All production assertions are on hot paths and optimized for minimal overhead.

Lessons Learned

1. Test What You Think You’re Testing

Our Byzantine tests weren’t testing what we thought. Always verify your test harness actually exercises the code paths you care about.

2. Determinism is Non-Negotiable

Any non-determinism in consensus protocols is a ticking time bomb. Use:

  • Seeded RNGs for all randomness
  • Deterministic tie-breakers
  • Explicit ordering for all collections

3. Assertions are Documentation

Each assertion is executable documentation of an invariant:

assert!(
    new_state.stream_exists(&stream_id),
    "stream {stream_id:?} must exist after creation"
);

Future developers (including yourself) will thank you.

4. Byzantine Failures are Weird

Byzantine failures don’t look like normal failures. They’re:

  • Precisely crafted to exploit edge cases
  • Often only detectable via quorum agreement
  • Hardest to reproduce and debug

You need specialized testing infrastructure.

5. Property Tests + Invariant Checkers = Confidence

The combination is powerful:

  • Property tests generate diverse scenarios
  • Invariant checkers catch violations automatically
  • Together they find bugs humans miss

What’s Next

With this hardening complete, we’re ready for:

  1. Bug Bounty Program: Submit findings to TigerBeetle/FoundationDB bounty programs
  2. Production Deployment: Safety checks in place for when users arrive
  3. Continuous Testing: Nightly 1M+ iteration VOPR campaigns
  4. More Invariants: Expand coverage to 100%

Conclusion

Building production-grade distributed systems requires:

  • Paranoid testing: Assume Byzantine adversaries
  • Comprehensive validation: 95%+ invariant coverage
  • Defense in depth: Assertions + invariants + property tests
  • Continuous improvement: Each bug is a learning opportunity

The 18 bugs we fixed would have caused data loss, corruption, or availability failures in production. Finding them in testing, not production, is what separates hobby projects from production-grade systems.


Stats:

  • 18 bugs fixed (5 critical Byzantine, 7 medium logic bugs)
  • 38 production assertions promoted
  • 12 new invariant checkers
  • 15 new VOPR test scenarios
  • ~3,500 lines of new code
  • 1,341 tests passing
  • 0 violations in 1M+ iteration fuzzing

Timeline: 20 days of focused work

Team: Claude Code + Human Oversight

Building trust in distributed systems, one invariant at a time.