TLC Caught Our First Consensus Bug: Why We Do Formal Verification
We’ve been formally verifying Kimberlite’s VSR consensus protocol with TLA+ and TLC model checking. Today, TLC found its first bug—a subtle violation of the Agreement invariant that could cause replicas to commit different values at the same position.
This is exactly the kind of bug that formal verification is meant to catch: subtle, hard to test, and catastrophic if it reaches production. Here’s the story of the bug, how TLC found it, and why we’re grateful we caught it at the specification level.
The Smoking Gun
TLC2 Version 2.19
Model checking in progress...
Error: Invariant Agreement is violated.
Error: The behavior up to this point is:
State 1: <Initial predicate>
State 2: <LeaderPrepare>
State 3: <FollowerOnPrepare>
State 4: <LeaderOnPrepareOkQuorum> # r1 commits operation 1
State 5: <StartViewChange>
...
State 13: <LeaderOnDoViewChangeQuorum> # New leader chooses WRONG log
...
State 18: <LeaderOnPrepareOkQuorum> # r3 commits DIFFERENT operation 1
TLC found a trace where:
- Replica r1 committed operation 1 with
view=0 - View change happened to
view=1 - New leader r2 chose an empty log, discarding the committed operation
- Leader r2 prepared a new operation 1 with
view=1 - Replica r3 committed this different operation 1
Result: Two replicas have different values at position 1. Agreement violated! 🚨
What is the Agreement Invariant?
In consensus, the Agreement property is fundamental:
If two replicas commit a value at the same log position, they must commit the same value.
This is what makes distributed consensus useful. Without Agreement, you can’t trust that replicas have the same data. It’s the difference between “distributed database” and “distributed chaos.”
In TLA+, we express this as:
Agreement ==
\A r1, r2 \in Replicas, op \in OpNumber :
(op <= commitNumber[r1] /\ op <= commitNumber[r2] /\ op > 0) =>
(op <= Len(log[r1]) /\ op <= Len(log[r2]) =>
EntriesEqual(log[r1][op], log[r2][op]))
If both replicas have committed operation op, their log entries at position op must be equal (same view, same command, same checksum).
The Bug: Two Issues in View Change
TLC’s counterexample revealed two distinct bugs:
Bug #1: Incorrect Log Selection
The Code (lines 294-299 of VSR.tla):
\* Find log with highest op number
mostRecentLog == CHOOSE dvc \in doVCs :
\A other \in doVCs : dvc.opNum >= other.opNum
\* Find highest commit number
maxCommit == CHOOSE c \in {dvc.commitNum : dvc \in doVCs} :
\A other \in {dvc.commitNum : dvc \in doVCs} : c >= other
The Problem: The leader chose:
- The log with the highest opNum (most operations)
- The max commitNum from any replica
But these could come from different replicas!
Scenario that breaks:
- Replica A:
opNum=5, commitNum=0, log=[e1@v0, e2@v0, ..., e5@v1](has operations, didn’t process Commit) - Replica B:
opNum=1, commitNum=1, log=[e1@v0](fewer operations, saw Commit message) - Leader incorrectly chooses: Replica A’s log (highest opNum) + commitNum=1
- But Replica A’s entry at position 1 might be different from Replica B’s!
Bug #2: StartView Re-Processing
The Code (line 358):
FollowerOnStartView(r, msg) ==
/\ msg \in messages
/\ msg.type = "StartView"
/\ msg.view >= view[r] \* WRONG: allows re-processing
/\ ...
/\ log' = [log EXCEPT ![r] = msg.replicaLog]
The Problem: The condition msg.view >= view[r] allows a replica to process StartView for the same view it’s already in. This overwrites the log with potentially stale data!
Scenario:
- Replica r3 receives StartView for view=1, transitions to Normal
- Leader prepares new operations, r3 receives them
- r3 receives the old StartView message again (still in message set)
- r3 processes it and overwrites its log with the empty one from the StartView
- View change happens, r3 becomes leader with empty log
- Agreement violated!
The Fix: Learn from TigerBeetle
We looked at TigerBeetle’s VSR implementation to see how they handle view changes correctly.
Fix #1: Use log_view for Canonicalization
TigerBeetle uses log_view: the view in which a replica’s log was last updated. During view changes:
- Find the highest
log_viewacross all DoViewChange messages - Filter to “canonical” logs (those with
log_view == max) - Among canonical logs, choose the one with highest opNum
- Set
commitNumto the max across all replicas
The Corrected Code:
\* Helper: Get the log_view (view of highest entry, or 0 if empty)
LogView(dvc) == IF dvc.opNum > 0 /\ Len(dvc.replicaLog) > 0
THEN dvc.replicaLog[Len(dvc.replicaLog)].view
ELSE 0
\* Find maximum log_view (canonical view)
maxLogView == CHOOSE lv \in {LogView(dvc) : dvc \in allDvcs} :
\A other \in {LogView(dvc) : dvc \in allDvcs} : lv >= other
\* Filter to canonical DVCs (those with max log_view)
canonicalDvcs == {dvc \in allDvcs : LogView(dvc) = maxLogView}
\* Among canonical DVCs, choose the one with highest op number
mostRecentLog == CHOOSE dvc \in canonicalDvcs :
\A other \in canonicalDvcs : dvc.opNum >= other.opNum
\* Find maximum commit number across ALL replicas
maxCommit == CHOOSE c \in {dvc.commitNum : dvc \in allDvcs} :
\A other \in {dvc.commitNum : dvc \in allDvcs} : c >= other
Why this works:
log_viewtracks which view a log was last updated in- Logs with
log_view < maxhave been superseded by view changes - Only logs with
log_view = maxare “canonical” (current) - Among canonical logs, the one with highest opNum has the most prepared operations
- By quorum intersection, at least one canonical log contains all committed operations
Fix #2: Guard StartView Processing
The Fix:
FollowerOnStartView(r, msg) ==
/\ msg \in messages
/\ msg.type = "StartView"
\* Only process StartView if:
\* 1. It's for a newer view (msg.view > view[r]), OR
\* 2. It's for the current view AND we're in ViewChange status
/\ (msg.view > view[r] \/ (msg.view = view[r] /\ status[r] = "ViewChange"))
/\ ...
Why this works:
- Replicas in Normal status won’t re-process StartView for the same view
- Replicas in ViewChange status can still transition to Normal for the same view
- Prevents log overwrites from stale messages
Verification Results
After the fixes, TLC model checking passes with no violations:
Model checking completed. No error has been found.
45,102 states generated, 23,879 distinct states found, 0 states left on queue.
The depth of the complete state graph search is 27.
All 6 safety invariants passed:
✅ TypeOK
✅ CommitNotExceedOp
✅ ViewMonotonic
✅ LeaderUniquePerView
✅ Agreement # The critical one!
✅ PrefixConsistency
Why This Matters
This bug demonstrates the value of formal verification:
1. Found Before Implementation
We caught this bug at the specification level, before writing any production code. Fixing a spec is a few lines of TLA+. Fixing production code after discovering data corruption would be:
- Emergency incident response
- Customer data analysis
- Patch deployment
- Reputation damage
2. Subtle Consensus Bug
This bug involves:
- Subtle interaction between view change and commit protocols
- Edge case requiring specific timing (committed op + view change + message reordering)
- Quorum intersection reasoning
- Agreement violation that would manifest rarely
Traditional testing would likely never catch this. Even VOPR (our deterministic simulator) might miss it due to the specific state space required.
3. Textbook Verification Win
This is a textbook example of why databases need formal verification:
- The protocol is complex (VSR view changes)
- The bug is subtle (requires understanding log_view semantics)
- The consequences are catastrophic (Agreement violation = data corruption)
- The fix is straightforward once identified
4. Builds Confidence
Knowing that TLC has exhaustively explored 45,000+ states and found no Agreement violations gives us confidence that the protocol is correct. Not just “probably works,” but proven correct for the state space explored.
Lessons Learned
1. Formal Verification Works
We invested time in:
- Writing TLA+ specifications
- Setting up TLC model checking
- Learning TLAPS for proofs
- Documenting invariants
That investment paid off by catching a critical bug before implementation.
2. Learn from Production Systems
When our spec failed, we looked at TigerBeetle’s implementation to understand the correct algorithm. They’ve run billions of transactions in production—learning from them is smart engineering.
3. State Space Matters
With the full configuration (MaxView=3, MaxOp=4), TLC was exploring 31M+ states and taking too long. We reduced parameters (MaxView=2, MaxOp=2) to get fast feedback (45K states in 1 second).
For thorough verification, we’ll run overnight with larger state spaces.
4. Multiple Verification Layers
We’re using multiple verification approaches:
- TLC: Bounded model checking (exhaustive for small state spaces)
- TLAPS: Unbounded mechanical proofs (covers all cases, but harder)
- VOPR: Deterministic simulation testing (tests implementation, not spec)
- Property testing: QuickCheck-style testing with random inputs
Each catches different classes of bugs.
What’s Next
1. Update Implementation
Our Rust VSR implementation needs to be reviewed against the corrected spec. We expect it to be correct (we followed TigerBeetle’s approach), but we’ll verify carefully.
2. Add Test Cases
We’ll add specific test cases to VOPR based on TLC’s counterexample:
- Committed operation + view change
- StartView message reordering
- View change with mismatched commitNum
3. Deeper Verification
Run TLC with larger parameters:
MaxView=4, MaxOp=6overnight (~1M states)- Full cluster (
Replicas=5) for production realism
4. TLAPS Proofs
We have TLAPS proofs in VSR_Proofs.tla that need updating for the corrected algorithm. The proofs will provide unbounded verification—true for all possible states, not just the ones TLC explored.
Conclusion
Formal verification found a critical consensus bug before we implemented it. The bug was subtle, the fix was clear, and we’re now confident in our protocol.
This is exactly why we invest in formal methods. Consensus is hard. Data corruption is catastrophic. Proving correctness upfront is worth every hour spent on TLA+.
If you’re building a distributed system and not using formal verification, you’re taking unnecessary risks. The bugs are out there, waiting. TLC will find them. Better it finds them in your spec than your customers find them in production.
Code: The fixes are in specs/tla/VSR.tla
Changelog: See CHANGELOG.md for full details
Learn More: Formal Verification Docs
Want to see more? Follow along as we build Kimberlite: kimberlite.io