When "aaaaaaa" > "b": Fixing Lexicographic Ordering in B+Tree Keys
We discovered a subtle but critical bug in Kimberlite’s B+Tree index encoding: text and bytes keys didn’t preserve lexicographic ordering. This meant range scans and index lookups could return incorrect results.
The bug? We were using length-prefix encoding.
The fix? Null-terminated strings with escape sequences (inspired by FoundationDB).
The bonus? We saved 3 bytes per key.
The Bug
In Kimberlite, every indexed column gets encoded into a sortable byte sequence for B+Tree storage. For integers, this is straightforward—sign-flip encoding ensures -1 sorts before 0, which sorts before 1.
For text, we thought we had it figured out:
// Original (WRONG) implementation
Text =>
The problem? Length dominates comparison.
When you compare two encoded keys byte-by-byte (which is what B+Trees do), the length prefix is compared first:
"b" → [0x02][0x00 0x00 0x00 0x01]['b']
^^^^ ^^^^^^^^^^^^^^^^^^^
type length = 1
"aaaaaaa" → [0x02][0x00 0x00 0x00 0x07]['a' 'a' 'a' 'a' 'a' 'a' 'a']
^^^^ ^^^^^^^^^^^^^^^^^^^
type length = 7
Byte-by-byte comparison:
- Type tags match:
0x02 == 0x02 - Length comparison:
0x00000001 < 0x00000007 - Result:
"b" < "aaaaaaa"WRONG!
In reality, "aaaaaaa" should come before "b" in lexicographic order.
The Impact
This breaks any query that relies on ordering:
(
id INTEGER PRIMARY KEY,
name TEXT
);
ON users(name);
INSERT INTO users VALUES
(1, 'Alice'),
(2, 'Bob'),
(3, 'aaaaaaa');
-- This should return: aaaaaaa, Alice, Bob
SELECT * FROM users ORDER BY name;
-- With length-prefix encoding, we get: Alice, Bob, aaaaaaa
-- Because len('aaaaaaa')=7 > len('Alice')=5
Range scans are similarly broken:
-- Find all names starting with 'a'
SELECT * FROM users WHERE name >= 'a' AND name < 'b';
-- Expected: Alice, aaaaaaa
-- Actual: Alice (aaaaaaa comes after 'b' due to length!)
This is a correctness bug, not a performance issue. Indexes return wrong results.
The Fix: Null-Terminated Encoding
The solution is to use null-terminated strings instead of length-prefixed strings. This is the approach used by FoundationDB and documented in their tuple layer spec.
Basic Idea
Instead of encoding length explicitly, we append a 0x00 byte to terminate the string:
Text =>
Now our strings look like:
"b" → [0x02]['b'][0x00]
"aaaaaaa" → [0x02]['a']['a']['a']['a']['a']['a']['a'][0x00]
Byte-by-byte comparison:
- Type tags match:
0x02 == 0x02 - First character:
'a' (0x61) < 'b' (0x62) - Result:
"aaaaaaa" < "b"CORRECT!
The Embedded Null Problem
But wait—what if the string itself contains a null byte?
"a\0b" → [0x02]['a'][0x00]['b'][0x00]
^^^^
Is this the terminator or part of the string?
We need an escape sequence.
Escape Sequence Magic
The trick: encode embedded nulls as 0x00 0xFF, and use a bare 0x00 as the terminator:
Text =>
Now:
"a\0b" → [0x02]['a'][0x00 0xFF]['b'][0x00]
^^^^^^^ ^^^^
escaped null terminator
"a" → [0x02]['a'][0x00]
^^^^
terminator
Why does this preserve ordering? Because 0x00 0xFF sorts after 0x00:
- Terminator:
[0x00] - Escaped null:
[0x00][0xFF]
When comparing byte-by-byte:
"a"→[0x02]['a'][0x00]"a\0b"→[0x02]['a'][0x00 0xFF]['b'][0x00]- At position 2:
0x00 < 0x00(equal) - At position 3: nothing vs
0xFF - Result:
"a" < "a\0b"CORRECT!
The Implementation
Encoding (lines 240-248 in key_encoder.rs)
Text =>
Decoding (lines 325-348)
Verification: Property-Based Testing
We can’t just test a few examples. We need property-based tests that verify ordering holds for arbitrary strings.
Using proptest, we generate thousands of random string pairs and verify the invariant:
proptest!
We run this with 10,000 iterations:
; ; ;
10,000 random string pairs, all ordering preserved.
We do the same for bytes:
proptest!
10,000 random byte sequences, all ordering preserved.
Edge Cases
We added comprehensive unit tests for edge cases that property tests might miss:
The Original Bug Case
Embedded Nulls
Empty Strings
High Byte Values
All tests pass.
The Bonus: Space Savings
Length-prefix encoding uses 4 bytes for the length:
"hello" → [type:1][len:4][data:5] = 10 bytes
Null-termination uses 1 byte for the terminator:
"hello" → [type:1][data:5][term:1] = 7 bytes
We save 3 bytes per string in the common case (no embedded nulls).
Worst case (string is all nulls):
"\0\0\0" → [type:1][0x00 0xFF][0x00 0xFF][0x00 0xFF][term:1] = 8 bytes
That’s 2× the original size plus 1 byte. But in practice, strings with many embedded nulls are rare.
For typical text data (names, descriptions, JSON keys), we get both:
- Correct ordering
- Smaller keys (−30% for 10-char strings)
Breaking Change Notice
This fix changes the on-disk format for text and bytes keys. Existing indexes must be rebuilt.
For a compliance-first database in pre-release, this is acceptable—we’d rather fix correctness bugs now than after production deployments.
If we needed backward compatibility, we could:
- Add a version byte to the key format
- Support both encodings during a transition period
- Provide a migration utility
But for now, we’re prioritizing correctness over compatibility. Build it right first.
Lessons Learned
1. Length-Prefix ≠ Lexicographic
Length-prefix encoding is great for:
- Parsing (know how many bytes to read)
- Validation (detect truncation)
- Self-describing formats (protobufs, msgpack)
But it’s terrible for:
- Sortable keys (length dominates comparison)
- B+Trees (need byte-wise lexicographic ordering)
- Range scans (breaks ordering invariants)
Use the right encoding for the job.
2. Property Tests Catch Ordering Bugs
Unit tests might check "a" < "b", but miss "aaaaaaa" < "b". Property tests generate thousands of random cases, including edge cases like:
- Empty strings
- Single characters
- Very long strings
- Strings with embedded nulls
- Strings with high byte values
Property tests found subtle cases our hand-written tests missed.
3. Learn From Production Systems
We didn’t invent null-terminated encoding. We copied it from FoundationDB, which has been battle-tested in production for years.
When building a database, steal from the best:
- FoundationDB: tuple encoding, deterministic simulation
- SQLite: B+Tree design, query planner
- PostgreSQL: MVCC, WAL design
- RocksDB: LSM trees, bloom filters
Standing on the shoulders of giants means fewer bugs and faster progress.
4. Assertions Are Documentation
Our decode function has assertions like:
debug_assert!;
debug_assert!;
These serve two purposes:
- Catch bugs early (in debug builds)
- Document invariants (what must be true at this point)
Following Pressurecraft’s “assertion density” principle, we aim for 2+ assertions per function. If you can’t assert it, you might not understand it.
5. Fix Bugs With Tests
We didn’t just fix the code—we added tests that would have caught the bug:
test_text_ordering_original_bug_case- the exact bug that triggered this work- Property tests with 10,000 iterations
- Edge case tests for embedded nulls, empty strings, etc.
Now if someone accidentally reverts the fix, the tests will fail immediately.
What’s Next
With text/bytes ordering fixed, we can:
- Add composite indexes - multi-column indexes now work correctly
- Enable range scans -
WHERE name >= 'A' AND name < 'B'returns correct results - Add LIKE optimization -
WHERE name LIKE 'A%'can use indexes - Test with real data - load realistic datasets and verify ordering
The fix also unblocked work on:
- Secondary indexes on text columns
- Full-text search (needs correct ordering for prefix scans)
- JSON path indexes (JSON keys are strings)
Try It Yourself
# Clone Kimberlite
# Run the tests
# Run property tests with 10k iterations
PROPTEST_CASES=10000
Create a table with text indexes and verify ordering:
use *;
let db = open?;
let tenant = db.create_tenant?;
tenant.execute?;
tenant.execute?;
tenant.execute?;
tenant.execute?;
let rows = tenant.query?;
// Should return: a, aaaaaaa, b (correct lexicographic order)
for row in rows
Output:
a
aaaaaaa
b
Correct ordering.
The Numbers
Test Results
Unit tests (key_encoder): 13/13 passed
Property tests (text_ordering): 10,000/10,000 passed
Property tests (bytes_ordering): 10,000/10,000 passed
Total kimberlite-query tests: 276/277 passed (1 ignored)
Space Savings
| String Length | Before (length-prefix) | After (null-term) | Savings |
|---|---|---|---|
| 5 chars | 10 bytes | 7 bytes | −30% |
| 10 chars | 15 bytes | 12 bytes | −20% |
| 50 chars | 55 bytes | 52 bytes | −5% |
| Empty string | 5 bytes | 2 bytes | −60% |
For typical application data (names, emails, short descriptions), we save 20-30% on key size.
Performance
Encoding: O(n) → O(n) (unchanged) Decoding: O(n) → O(n) (unchanged)
No regression. The escape sequence check is a simple if byte == 0x00, which compiles to a single comparison.
Found an issue? Open a PR. We’re building in public—bugs, fixes, and lessons learned.
Want more stories about building Kimberlite? Check out our post on how VOPR (our deterministic simulator) found five subtle bugs in our linearizability checker, and learn about our Pressurecraft coding philosophy.
Further Reading
- FoundationDB Tuple Layer Specification — The inspiration for this fix. Excellent documentation on null-terminated encoding with escape sequences.
- SQLite File Format: Record Format — SQLite’s variable-length encoding approach, which uses a different strategy (varints) to solve similar problems.
- LevelDB Implementation Notes — LevelDB uses length-prefixed keys for parsing efficiency, but relies on separate comparators for ordering.
- Testing Distributed Systems for Linearizability — Kyle Kingsbury (Aphyr) on linearizability checking, which we use in our VOPR simulator.