Testing Strategy
On this page
- Table of Contents
- Philosophy
- Test the Implementation, Not a Model
- Simulation > Formal Proofs for Bug Finding
- Assertions Are Safety Nets
- Testing Pyramid
- Time Investment
- Deterministic Simulation Testing (DST)
- Why Deterministic?
- How It Works
- VOPR Architecture
- Components
- Fault Types
- Invariant Checkers
- Swizzle-Clogging Tests
- Enhanced Fault Categories
- Time Compression
- Model Verification
- Running VOPR
- VOPR Predefined Scenarios
- Assertion Strategy
- Assertion Density Goal
- Paired Assertions
- Compound Assertions
- Debug vs Release
- Production Assertions (38 Promoted)
- Property-Based Testing
- Approach
- What to Property Test
- Integration Testing
- Patterns
- Fuzzing
- Fuzz Targets
- Running Fuzz Tests
- CI Smoke Testing
- Corpus Management
- Reproducing Crashes
- Performance Benchmarking
- Benchmark Suites
- Running Benchmarks
- Interpreting Results
- Performance Targets
- Running Tests
- Unit Tests
- Property Tests
- Simulation
- Fuzzing
- Benchmarks
- CI Pipeline
- Debugging Failures
- Reproducing VOPR Failures
- Shrinking
- Debugging with Traces
- Common Failure Patterns
- Summary
Kimberlite is a compliance-critical system. Our testing strategy prioritizes finding bugs that could compromise data integrity, consensus correctness, or audit trail reliability. This document describes our approach, inspired by TigerBeetle’s deterministic simulation testing.
Table of Contents
- Philosophy
- Testing Pyramid
- Deterministic Simulation Testing (DST)
- VOPR Architecture
- Assertion Strategy
- Property-Based Testing
- Integration Testing
- Running Tests
- Debugging Failures
Philosophy
Test the Implementation, Not a Model
We test the actual production code, not a simplified model of it:
- TLA+ is for design: Formal specifications help us think, but they don’t find implementation bugs
- Simulation tests real code: Our simulator runs the actual consensus and storage code
- No mocks in the core: The kernel and consensus layers use real implementations, not test doubles
Simulation > Formal Proofs for Bug Finding
TigerBeetle’s experience shows that deterministic simulation testing finds more bugs than formal methods alone:
- Formal proofs verify the algorithm is correct
- Simulation finds the bugs in the implementation of that algorithm
- Most bugs are in edge cases: recovery, network partitions, disk failures
Assertions Are Safety Nets
Assertions catch bugs early, but they’re not a substitute for understanding:
// Good: Assertion documents and checks invariant
// Bad: Assertion without understanding
Testing Pyramid
Our testing strategy uses multiple layers:
┌───────────────┐
│ Simulation │ VOPR: Full cluster under faults
│ (DST) │ Hours of simulated time
└───────┬───────┘
│
┌───────┴───────┐
│ Property │ Proptest: Randomized invariant checking
│ Tests │ Hundreds of cases per test
└───────┬───────┘
│
┌───────────────┴───────────────┐
│ Integration Tests │ Multi-component, real I/O
│ │ Happy paths + edge cases
└───────────────┬───────────────┘
│
┌───────────────────────┴───────────────────────┐
│ Unit Tests │ Single functions
│ │ Fast, deterministic
└───────────────────────────────────────────────┘
Time Investment
| Layer | % of Tests | Run Time | When to Run |
|---|---|---|---|
| Unit | 60% | Milliseconds | Every save |
| Integration | 20% | Seconds | Pre-commit |
| Property | 15% | Minutes | CI |
| Simulation | 5% | Hours | Nightly/Weekly |
Deterministic Simulation Testing (DST)
DST is our primary tool for testing consensus and replication. It allows us to:
- Run thousands of nodes in a single process
- Inject faults precisely and reproducibly
- Control time to test timeouts and leader election
- Reproduce failures with seeds
Why Deterministic?
A test is deterministic if, given the same inputs, it produces the same outputs. For simulation testing, this means:
- Same seed → Same execution: Every message, fault, and timeout happens identically
- Reproducible bugs: A failing seed always fails the same way
- Debuggable: Step through the exact sequence that caused failure
How It Works
The simulator replaces all sources of non-determinism:
// Production code uses traits for external dependencies
// Simulator provides deterministic implementations
VOPR Architecture
VOPR (Kimberlite OPerations Randomizer) is our deterministic simulator, inspired by TigerBeetle’s VOPR.
Components
┌─────────────────────────────────────────────────────────────────┐
│ VOPR │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Supervisor │ │
│ │ - Drives simulation clock │ │
│ │ - Schedules faults │ │
│ │ - Runs checkers │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Simulated Node │ │ Simulated │ │ Simulated │ │
│ │ 0 │ │ Node 1 │ │ Node 2 │ │
│ │ │ │ │ │ │ │
│ │ ┌────────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │
│ │ │ Runtime │ │ │ │ Runtime │ │ │ │ Runtime │ │ │
│ │ └────────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │
│ │ ┌────────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │
│ │ │ Kernel │ │ │ │ Kernel │ │ │ │ Kernel │ │ │
│ │ └────────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │
│ │ ┌────────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │
│ │ │ Storage │ │ │ │ Storage │ │ │ │ Storage │ │ │
│ │ └────────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │
│ └──────────────────┘ └──────────────┘ └──────────────┘ │
│ │ │
│ ┌───────────────┴───────────────┐ │
│ ▼ ▼ │
│ ┌──────────────────────┐ ┌──────────────────────────────┐ │
│ │ Simulated Network │ │ Simulated Time │ │
│ │ - Message queue │ │ - Discrete events │ │
│ │ - Partition faults │ │ - Timeout scheduling │ │
│ │ - Delay injection │ │ - Deterministic ordering │ │
│ └──────────────────────┘ └──────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Fault Injector │ │
│ │ - Node crashes - Message corruption │ │
│ │ - Network partitions - Bit flips in storage │ │
│ │ - Message reordering - Slow disks │ │
│ │ - Message drops - Full disks │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Invariant Checkers │ │
│ │ - Offset monotonicity - Hash chain integrity │ │
│ │ - Log consistency - MVCC correctness │ │
│ │ - Replica convergence - Projection consistency │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Fault Types
VOPR can inject various fault types, including advanced patterns inspired by FoundationDB and TigerBeetle.
Network Faults:
Storage Faults:
Node Faults:
Gray Failures (TigerBeetle-inspired):
Gray failures are partial failures that are harder to detect than complete crashes:
Gray failures are particularly dangerous because:
- Nodes appear healthy (respond to heartbeats)
- Timeouts may not trigger (responses arrive, just slowly)
- State can diverge subtly over time
Invariant Checkers
After each step, VOPR runs invariant checks:
/// All committed entries must be identical across replicas
;
/// Projections must match log contents
;
/// Hash chain must be valid
;
/// Byte-for-byte replica comparison (TigerBeetle-inspired)
/// Verifies that all caught-up replicas have identical storage
;
Swizzle-Clogging Tests
Swizzle-clogging (from FoundationDB) randomly clogs and unclogs network connections to find partition edge cases:
/// Swizzle-clogger randomly blocks/unblocks network to nodes
What swizzle-clogging finds:
- Race conditions during partition healing
- View change edge cases when leader becomes reachable
- Message ordering bugs when clogged messages arrive in bursts
- Timeout tuning issues
Enhanced Fault Categories
VOPR distinguishes between different types of storage faults for Protocol-Aware Recovery (PAR):
/// Prepare status for PAR protocol
PAR Truncation Rule: A prepare can only be truncated if 4+ of 6 replicas report NotSeen. This prevents truncating prepares that might have been committed (if a replica has Seen or Corrupt, the prepare might be committed).
Time Compression
VOPR uses simulated time with compression ratios of 10:1 or higher:
Time compression allows testing hours of simulated operation in minutes of wall-clock time.
Model Verification
VOPR’s model verification ensures that the simulated database state matches expected values even under extreme fault injection. This validates read-your-writes semantics and durability guarantees.
How It Works:
- Maintains an in-memory model (
KimberliteModel) tracking both pending (unfsynced) and durable (fsynced) writes - After each operation, compares actual storage state against model expectations
- Verifies that reads match what was written, even with write reordering, fsync failures, and crashes
Key Features:
- Read-your-writes guarantee: Maintained even with write reordering enabled
- Fsync failure handling: Pending writes cleared from model when fsync fails, matching storage behavior
- Checkpoint recovery: Full crash/recovery cycle testing with state synchronization
- Strict verification: Zero tolerance for inconsistencies, aligning with Kimberlite’s compliance focus
Alignment with Formal Specs:
The model verification directly tests assumptions made in specs/tla/Recovery.tla:
- Committed entries persist through crashes (Recovery.tla:108)
- Uncommitted entries may be lost on fsync failure (Recovery.tla:112)
- Recovery protocol restores committed state from quorum (Recovery.tla:118-199)
Example Verification:
// After a write succeeds, record it in the model
model.apply_pending_write;
// Later, when reading:
let actual = storage.read;
assert!;
Performance: Model verification adds <0.1% overhead while catching critical bugs that would otherwise manifest as data loss or corruption in production.
Running VOPR
# Run simulation with random seed
# Run with specific seed (for reproduction)
# Run for longer (default: 1000 operations)
# Run with more aggressive faults
# Run continuously, report statistics
VOPR Predefined Scenarios
VOPR includes 27 predefined test scenarios across 5 categories:
# List all available scenarios
# Run a specific scenario
Scenario Categories:
| Category | Count | Description |
|---|---|---|
| Byzantine Attacks | 5 | Protocol-level Byzantine mutations testing VSR handler validation |
| Corruption Detection | 3 | Bit flips, checksum validation, silent disk failures |
| Recovery & Crashes | 3 | Crash during commit/view change, recovery with corrupt log |
| Gray Failures | 2 | Slow disk I/O, intermittent network partitions |
| Race Conditions | 2 | Concurrent view changes, commit during DoViewChange |
| Network & General | 12 | Original scenarios (baseline, swizzle-clogging, multi-tenant, etc.) |
High-Priority Byzantine Attack Scenarios (added in v0.2.0):
| Scenario | Bug Tested | Expected Behavior |
|---|---|---|
byzantine_dvc_tail_length_mismatch | Bug 3.1 | Reject DoViewChange with log_tail length ≠ claimed ops |
byzantine_dvc_identical_claims | Bug 3.3 | Deterministic tie-breaking via checksum → replica ID |
byzantine_oversized_start_view | Bug 3.4 | Reject StartView with >10k log entries (DoS protection) |
byzantine_invalid_repair_range | Bug 3.5 | Reject RepairRequest with invalid ranges |
byzantine_invalid_kernel_command | Bug 3.2 | Gracefully handle Byzantine commands during commit |
Running Comprehensive Validation:
# Byzantine attack scenarios (10k iterations each)
for; do
done
# Corruption detection scenarios (5k iterations each)
for; do
done
# Long-running fuzzing campaign (1M iterations)
Validation Results (v0.2.0):
- Total scenarios: 27 (up from 12)
- Iterations tested: 1M+ across all scenarios
- Invariant violations: 0
- Byzantine rejections: Working correctly (instrumented and verified)
See docs-internal/vopr/scenarios.md for detailed configuration and usage examples for all 46 scenarios.
Assertion Strategy
Assertions are our first line of defense against bugs.
Assertion Density Goal
Every function should have at least 2 assertions: one precondition and one postcondition.
Paired Assertions
Write assertions in pairs—one at the write site, one at the read site:
// Write site
// Read site
Compound Assertions
Split compound conditions for better error messages:
// Bad: Compound assertion
assert!;
// Good: Split assertions
assert!;
assert!;
Debug vs Release
assert!(): Critical invariants, always checkeddebug_assert!(): Expensive checks, debug builds only
// Always check: corruption would be catastrophic
assert!;
// Debug only: O(n) validation too expensive for production
debug_assert!;
Production Assertions (38 Promoted)
As part of our VSR hardening initiative, we promoted 38 critical debug_assert!() calls to production assert!() for runtime safety enforcement.
Categories:
- Cryptography (25): All-zero detection, key hierarchy integrity, ciphertext validation
- Consensus (9): Leader-only operations, view/commit monotonicity, quorum validation
- State Machine (4): Stream existence, effect counts, offset monotonicity
Why Production Assertions:
- Detect corruption BEFORE it propagates
- Catch Byzantine attacks in real-time
- Provide forensic evidence of failure mode
- Negligible performance impact (<0.1% throughput regression)
Testing: Every assertion has a corresponding #[should_panic] test in crates/kimberlite-crypto/src/tests_assertions.rs.
Performance Impact:
- Throughput: <0.1% regression
- p99 latency: +1μs
- p50 latency: <1μs
See docs/ASSERTIONS.md for complete guide on production assertion strategy.
Example Test:
Property-Based Testing
We use proptest for randomized invariant checking.
Approach
Property tests generate random inputs and verify that invariants hold:
use *;
proptest!
What to Property Test
| Component | Properties |
|---|---|
| Log | Hash chain integrity, sequential positions, CRC validity |
| B+Tree | Sorted order, balanced height, key uniqueness |
| MVCC | Version visibility, no phantom reads |
| Consensus | Agreement, validity, termination |
Integration Testing
Integration tests verify multi-component behavior with real I/O.
Patterns
Setup/Teardown with tempdir:
Async tests with tokio:
async
Fuzzing
Fuzzing uses randomized inputs to find crashes, panics, and edge cases in parsing and cryptographic code.
Fuzz Targets
Kimberlite includes two fuzz targets:
fuzz_wire_deserialize: Wire protocol parsing (Frame, Request, Response)fuzz_crypto_encrypt: AES-256-GCM encryption round-trips and error handling
Running Fuzz Tests
# Install cargo-fuzz (requires nightly Rust)
# List available fuzz targets
# Run a fuzz target (Ctrl+C to stop)
# Run with specific iteration count
# Run with specific seed for reproduction
# Run in parallel (4 jobs)
CI Smoke Testing
For fast CI validation, run a limited number of iterations:
# Smoke test (10K iterations, ~30 seconds)
&&
Corpus Management
Fuzzing automatically saves interesting inputs to fuzz/corpus/:
# View corpus files
# Clear corpus to start fresh
# Run with custom seed corpus
Reproducing Crashes
When fuzzing finds a crash, it saves the input to fuzz/artifacts/:
# Reproduce a crash
# Debug with gdb/lldb
See fuzz/README.md for detailed documentation.
Performance Benchmarking
Kimberlite uses Criterion.rs for statistical performance benchmarking.
Benchmark Suites
| Suite | File | What It Tests |
|---|---|---|
crypto | benches/crypto.rs | Hash, encryption, signing operations |
kernel | benches/kernel.rs | State machine transitions |
storage | benches/storage.rs | Append-only log operations |
wire | benches/wire.rs | Protocol serialization |
end_to_end | benches/end_to_end.rs | Full system throughput |
Running Benchmarks
# Run all benchmarks
# Run specific suite
# Quick mode (fewer samples, faster)
# Run specific benchmark
# Save baseline for comparison
# Compare against baseline
Interpreting Results
blake3_hash/1024 time: [498.23 ns 501.45 ns 504.98 ns]
thrpt: [2.03 GB/s 2.04 GB/s 2.05 GB/s]
- time: 95% confidence interval (lower, estimate, upper)
- thrpt: Throughput calculated from input size
Regression detection:
change: [+15.234% +18.567% +21.823%] (p = 0.00 < 0.05)
Performance has regressed.
Performance Targets
| Operation | Target | Measured | Status |
|---|---|---|---|
| BLAKE3 1KB | < 1 µs | ~500 ns | 2x better |
| AES-GCM Encrypt 1KB | < 5 µs | ~2 µs | 2.5x better |
| Ed25519 Sign | < 100 µs | ~10-20 µs | 5-10x better |
| Storage Write 1KB | < 500 µs | ~380 µs | Met |
| Kernel AppendBatch | < 20 µs | ~1.5 µs | 13x better |
| E2E Write p99 | < 5 ms | ~190 µs | 26x better |
See crates/kimberlite-bench/README.md for detailed usage and CI integration.
Running Tests
Unit Tests
# Run all unit tests
# Run tests for specific crate
# Run specific test
# Run with output
Property Tests
# Run property tests (more cases than default)
PROPTEST_CASES=1000
# Run with specific seed for reproduction
PROPTEST_CASES=1
Simulation
# Run VOPR simulator
# Or use just:
# List available scenarios
# Run specific scenario
# Run all scenarios
# Run with specific seed
# Run extended simulation
Fuzzing
# List fuzz targets
# Run fuzzer (Ctrl+C to stop)
# Run smoke test (10K iterations, for CI)
# Run with specific iteration count
# Run all fuzz targets
Benchmarks
# Run all benchmarks
# Run in quick mode (faster, fewer samples)
# Run specific suite
# Save baseline
# Compare against baseline
# Run all suites and open HTML report
CI Pipeline
test:
# Fast: unit tests
- cargo test --workspace
# Medium: property tests with more cases
- PROPTEST_CASES=500 cargo test --workspace
# Slow: short simulation
- cargo run --bin vopr --release -- --operations 10000
nightly:
# Extended simulation
- cargo run --bin vopr --release -- --operations 10000000 --timeout 28800
Debugging Failures
Reproducing VOPR Failures
When VOPR finds a failure, it prints the seed:
VOPR: Invariant violation detected!
Seed: 0x1234567890abcdef
Operation: 4532
Violation: LogDivergence at position 1234
To reproduce:
cargo run --bin vopr -- --seed 0x1234567890abcdef
Run with the seed to reproduce exactly:
Shrinking
VOPR attempts to find a minimal reproduction:
VOPR: Shrinking failure...
Original: 4532 operations
Shrunk: 23 operations
Minimal reproduction seed: 0x1234567890abcdef_shrunk_23
Debugging with Traces
Enable detailed tracing to understand what happened:
RUST_LOG=vopr=trace
Common Failure Patterns
| Symptom | Likely Cause |
|---|---|
| LogDivergence | Bug in consensus prepare/commit |
| HashChainBroken | Bug in hash computation or storage corruption handling |
| LinearizabilityViolation | Bug in read consistency implementation |
| ProjectionInconsistent | Bug in projection apply logic |
| Timeout | Liveness bug in leader election |
Summary
Kimberlite’s testing strategy is built on layers:
- Unit tests: Fast, run constantly, catch obvious bugs
- Property tests: Randomized, find edge cases
- Integration tests: Real I/O, verify component interactions
- Simulation tests: Find consensus and replication bugs under faults
Advanced patterns from FoundationDB and TigerBeetle enhance our simulation:
- Swizzle-clogging: Random network clog/unclog to find partition edge cases
- Gray failures: Partial failures (slow, intermittent) that evade simple detection
- Byte-identical replica checkers: Verify caught-up replicas match exactly
- PAR fault categories: Distinguish “not seen” vs “seen but corrupt”
- Time compression: 10x+ speedup for extended simulation runs
The goal is not 100% code coverage, but confidence that:
- The log is always consistent
- Committed data is never lost
- Hash chains are never broken
- Projections match the log
- Replicas are byte-identical when caught up
- Recovery never truncates committed data
- The system recovers from any fault combination
When in doubt, add an assertion. When that assertion fires in simulation, you’ve found a bug before it reached production.
See Also:
- VOPR Deep Dive (Internal) - Detailed VOPR implementation and debugging
- VOPR Scenarios (Internal) - All 46 test scenarios
- Assertions Guide - Production assertion patterns
- Property Testing - Proptest strategies