VOPR Found Our First Bugs: A Linearizability Detective Story
We built VOPR (Viewstamped Operation Replication simulator) to stress-test Kimberlite’s consistency guarantees through deterministic simulation. On its first serious run, it immediately found bugs. Five of them. Not in the database itself—in our test infrastructure.
This is that detective story. A hunt for five increasingly subtle bugs, from “whoops, zero is a number” to “enhanced workloads with missing consistency tracking.”
The Smoking Gun
Running seed 96... FAILED at event 100
Invariant: linearizability
Message: Final history is not linearizable
======================================
Results:
Successes: 95
Failures: 5
Rate: 69697 sims/sec
Failed seeds (for reproduction):
vopr --seed 40 -v
vopr --seed 43 -v
vopr --seed 47 -v
vopr --seed 65 -v
vopr --seed 96 -v
5% of simulation runs were failing linearizability checks. Every failure was deterministic—the same seeds failed every time. This meant the bugs were real, not flaky.
Bug #1: Zero is a Valid Value
The Setup: Our linearizability checker tracks the history of read/write operations to verify they could have occurred in some sequential order. It maintained a current_value: u64 for each key.
The Bug: This code in invariant.rs:459-463:
let expected = if current_value == 0 else ;
The checker assumed that if current_value == 0, reads should return None (meaning “never written”). But 0 is a perfectly valid value to write!
The Scenario:
- Write value
0to key K - Storage returns
[0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00] - Read from key K returns
Some(0) - Linearizability checker expects
Nonebecausecurrent_value == 0 - Check fails: read saw
Some(0)but expectedNone
The Fix: Use Option<u64> instead of u64 to distinguish “never written” (None) from “written with value 0” (Some(0)):
Impact: 40% of failures eliminated.
Bug #2: Ignoring Storage Failures
The Setup: VOPR simulates fault injection—writes can fail, return partial data, or complete successfully. The simulation tracked operations for linearizability checking separately from storage state.
The Bug: This code in vopr.rs:290:
// Write to storage
let data = value.to_le_bytes.to_vec;
let _ = storage.write; // Result ignored!
// Schedule completion
sim.schedule_after;
We told the linearizability checker that writes succeeded even when storage rejected them!
The Scenario:
- Write(key=4, value=A) invoked, storage write fails due to fault injection
- We track: Write(4, A) succeeded
- Read(key=4) sees
None(write never persisted) - We track: Read(4, None)
- Write(key=4, value=B) invoked, storage write succeeds
- We track: Write(4, B) succeeded
- Read(key=4) sees
Some(A)(wait, what?!)
No, that’s wrong. Let me trace through again:
- Write(key=4, value=A) completes successfully, writes to storage
- Write(key=4, value=B) attempts but fails, doesn’t update storage
- We track: Write(4, B) succeeded ✗ (BUG!)
- Read(key=4) sees
Some(A)(from first write) - Linearizability checker expects
Some(B)(from tracked second write) - Check fails!
The Fix: Only track operations that actually succeeded in storage:
// Write to storage first
let data = value.to_le_bytes.to_vec;
let write_result = storage.write;
// Only track if write succeeded completely
let write_success = matches!;
if write_success
Impact: 80% of remaining failures eliminated.
Bug #3: Partial is Not Success
The Setup: Storage can perform partial writes—only writing some bytes before failing. Partial writes return WriteResult::Partial { bytes_written: 3 } instead of Success { bytes_written: 8 }.
The Bug: Our initial “fix” only checked for WriteResult::Success, but didn’t verify that ALL bytes were written:
let write_success = matches!;
A write returning Success { bytes_written: 3 } would be tracked as a successful write of a full 8-byte u64!
The Scenario:
- Write(key=5, value=42) writes only 3 bytes due to fault injection
- Returns
Success { bytes_written: 3 } - We track: Write(5, 42) succeeded
- Read(key=5) gets
Success { data: [3 bytes] } - We map 3 bytes to
None(not enough for a u64) - Linearizability checker sees: Write succeeded but read got None
- Check fails!
The Fix: Verify complete writes and skip partial reads:
// For writes: check bytes_written matches expected length
let write_success = matches!;
// For reads: only track complete reads
match result
Impact: 99.7% pass rate achieved (997/1000 seeds).
Bug #4: The Pending Operations Race
After the first three fixes, we still had ~0.1% failure rate (10 out of 10,000 runs). Time to dig deeper.
The Setup: VOPR simulates asynchronous operations. When a write is issued:
- We invoke the operation (start tracking it)
- Write to storage immediately
- Schedule a completion event for later
- When the completion event fires, we call respond() to mark the operation complete
The simulation runs until it hits max_events (default 100).
The Bug: Operations invoked near the end of the simulation might not complete before the simulation ends.
The Scenario (seed 66):
Event 14: Write(key=2, value=A) invoked
-> invoke() called, op_id=6 created
-> Storage updated with value A
-> Completion scheduled for later
Event 24: Completion event for op_id=6 fires
-> respond(op_id=6) called
-> Operation marked complete ✓
...
Event 99: Write(key=2, value=B) invoked
-> invoke() called, op_id=38 created
-> Storage updated with value B (overwrites A!)
-> Completion scheduled for later
Event 100: Read(key=2) returns B
-> Read sees value B in storage
-> Tracked as Op 39: Read(key=2, value=B)
Simulation ends (hit max_events=100)
-> Completion event for op_id=38 never fires
-> op_id=38 never gets respond() called
Linearizability check:
- Completed operations: Op 6 (Write A), Op 39 (Read B)
- Op 38 (Write B) is not in the completed list!
- Checker sees: Read B happened-after Write A, but B != A
- Violation!
The operation was invoked and modified storage state, but never marked complete, so the linearizability checker didn’t know about it.
The Fix: Complete all pending operations before running the final linearizability check:
// Complete all pending operations before final check
// In a real system, pending operations might time out, but for linearizability
// checking we need to account for all operations that modified storage state
for in &pending_ops
// Now do the linearizability check
let lin_result = linearizability_checker.check;
Impact: 100% pass rate achieved! 10,000 out of 10,000 runs passed.
Lessons Learned
1. Zero is Not Special (Unless You Make It Special)
When using sentinel values, be explicit:
-
let current_value: u64 = 0; // Is this "uninitialized" or "value 0"? -
let current_value: Option<u64> = None; // Explicit "not yet set"
2. Parse, Don’t Validate—Especially For I/O
Don’t just check if an operation “succeeded”:
// Bad: Binary success/failure
if write_succeeded
// Good: Validate the actual result
match storage.write
3. Test Infrastructure Has Bugs Too
We spent hours debugging the database before realizing the bugs were in the test harness. Good! Finding bugs in tests early means they won’t mask real bugs later.
VOPR is doing its job: making it impossible to ignore edge cases.
4. Asynchronous Operations Need Careful Boundaries
When operations can be in-flight (invoked but not completed), you must decide: what happens at the end of your test/simulation/time window?
- Ignore pending operations? (Missed bugs where storage state changed)
- Complete them all? (What we chose)
- Don’t start operations that can’t complete? (Conservative but limits test coverage)
There’s no universal answer. The key is being explicit about the choice and understanding its implications.
5. Deterministic Simulation is a Superpower
Every failure was reproducible with --seed N. No “works on my machine”, no “only happens in CI”. This is what makes deterministic simulation so powerful—bugs can’t hide.
6. Test the Tests When Adding Features
When extending your test harness with new operation types, audit ALL invariant checkers:
- Does the new operation affect linearizability? Track it.
- Does it change replica state? Update consistency checks.
- Does it modify the model? Keep it in sync with all checkers.
A mismatch between what storage sees and what invariant checkers track creates false positives that erode confidence in your test suite.
Bug #5: Enhanced Workloads Missing Linearizability Tracking
After fixing bugs #1-#4 and achieving 100% pass rate, we added enhanced workload patterns (Read-Modify-Write and Scan operations) to increase test coverage. Immediately, VOPR started failing again with a 9% failure rate.
The Setup: We added Read-Modify-Write (RMW) operations to simulate realistic workload patterns like incrementing counters.
The Bug: This code in vopr.rs:639-644:
if success
The RMW operation updated storage and the correctness model, but never tracked the write in the linearizability checker!
The Scenario:
- RMW operation runs on an empty key
- Sets value to
1(viaunwrap_or(1)) - Storage updated: key →
Some(1) - Model updated: key →
1 - But linearizability checker has no record of the write!
- Later read returns
Some(1) - Linearizability checker sees: Read(key) → Some(1) with no prior Write
- Check fails!
The Pattern: Every failure showed reads returning Some(1) for keys with no corresponding write in the history:
- Seed 43:
Read { key: 7, value: Some(1) }- no write to key 7 - Seed 96:
Read { key: 4, value: Some(1) }- no write to key 4 - Seed 102:
Read { key: 1, value: Some(1) }- no write to key 1
All these keys were written by untracked RMW operations.
The Fix: Track RMW operations just like regular writes:
if success
Impact: 100% pass rate restored across 1,000+ runs.
Lesson Learned: When adding new operation types to a test harness, ensure ALL invariant checkers are updated consistently. It’s easy to update storage and the correctness model while forgetting to update the linearizability checker—and the test will silently produce false positives.
The Numbers
Before fixes:
Successes: 95
Failures: 5
Failure rate: 5%
After bug #1-#3:
Successes: 997
Failures: 3
Failure rate: 0.3%
After bug #4:
Successes: 10,000
Failures: 0
Failure rate: 0%
Rate: 90,000+ sims/sec
After adding enhanced workloads (bug #5 introduced):
Successes: 91
Failures: 9
Failure rate: 9%
After bug #5 (final):
Successes: 1,000
Failures: 0
Failure rate: 0%
Rate: 157,000+ sims/sec
Five subtle bugs, each harder to find than the last. Each one a reminder that testing distributed systems requires thinking about timing, state, partial failures, and consistency across all system components in ways our intuition often misses.
What’s Next
Now that VOPR’s linearizability checker and enhanced workload patterns are solid, we can:
- Add more invariants: Hash chain integrity (implemented), replica consistency (implemented), log monotonicity, commit history tracking (implemented)
- Scale up: Run millions of simulations in CI, not just thousands
- Find real bugs: Start testing Kimberlite’s actual VSR replication code, not just the test infrastructure
- Add more workload patterns: Transactions, snapshots, secondary indexes
The fact that VOPR immediately found bugs (even if they were in the tests themselves) proves it works. Each time we enhanced the workloads, VOPR caught inconsistencies we might have missed. We’re ready to turn it loose on the real database.
Try It Yourself
# Clone Kimberlite
# Run VOPR
# Reproduce a failure
VOPR is open source and designed for reuse. If you’re building a distributed system and want deterministic simulation testing, steal our code. That’s what it’s for.
Found a bug in VOPR or Kimberlite? Open an issue or submit a PR. We’ll be launching a bug bounty program once Kimberlite reaches a stable release—rewarding security researchers and contributors who help us build a more robust system.
Want to learn more about linearizability? Read Strong consistency models by Aphyr or Testing Distributed Systems for Linearizability by Kingsbury.