← Back to Blog

When "aaaaaaa" > "b": Fixing Lexicographic Ordering in B+Tree Keys

We discovered a subtle but critical bug in Kimberlite’s B+Tree index encoding: text and bytes keys didn’t preserve lexicographic ordering. This meant range scans and index lookups could return incorrect results.

The bug? We were using length-prefix encoding.

The fix? Null-terminated strings with escape sequences (inspired by FoundationDB).

The bonus? We saved 3 bytes per key.

The Bug

In Kimberlite, every indexed column gets encoded into a sortable byte sequence for B+Tree storage. For integers, this is straightforward—sign-flip encoding ensures -1 sorts before 0, which sorts before 1.

For text, we thought we had it figured out:

// Original (WRONG) implementation
Value::Text(s) => {
    buf.push(0x02);  // Type tag
    let bytes = s.as_bytes();
    buf.extend_from_slice(&(bytes.len() as u32).to_be_bytes());  // 4 bytes of length
    buf.extend_from_slice(bytes);  // The actual string
}

The problem? Length dominates comparison.

When you compare two encoded keys byte-by-byte (which is what B+Trees do), the length prefix is compared first:

"b"       → [0x02][0x00 0x00 0x00 0x01]['b']
              ^^^^  ^^^^^^^^^^^^^^^^^^^
              type  length = 1

"aaaaaaa" → [0x02][0x00 0x00 0x00 0x07]['a' 'a' 'a' 'a' 'a' 'a' 'a']
              ^^^^  ^^^^^^^^^^^^^^^^^^^
              type  length = 7

Byte-by-byte comparison:

  1. Type tags match: 0x02 == 0x02
  2. Length comparison: 0x00000001 < 0x00000007
  3. Result: "b" < "aaaaaaa" WRONG!

In reality, "aaaaaaa" should come before "b" in lexicographic order.

The Impact

This breaks any query that relies on ordering:

CREATE TABLE users (
    id INTEGER PRIMARY KEY,
    name TEXT
);

CREATE INDEX idx_name ON users(name);

INSERT INTO users VALUES
    (1, 'Alice'),
    (2, 'Bob'),
    (3, 'aaaaaaa');

-- This should return: aaaaaaa, Alice, Bob
SELECT * FROM users ORDER BY name;

-- With length-prefix encoding, we get: Alice, Bob, aaaaaaa
-- Because len('aaaaaaa')=7 > len('Alice')=5

Range scans are similarly broken:

-- Find all names starting with 'a'
SELECT * FROM users WHERE name >= 'a' AND name < 'b';

-- Expected: Alice, aaaaaaa
-- Actual: Alice (aaaaaaa comes after 'b' due to length!)

This is a correctness bug, not a performance issue. Indexes return wrong results.

The Fix: Null-Terminated Encoding

The solution is to use null-terminated strings instead of length-prefixed strings. This is the approach used by FoundationDB and documented in their tuple layer spec.

Basic Idea

Instead of encoding length explicitly, we append a 0x00 byte to terminate the string:

Value::Text(s) => {
    buf.push(0x02);  // Type tag
    buf.extend_from_slice(s.as_bytes());  // UTF-8 bytes
    buf.push(0x00);  // Terminator
}

Now our strings look like:

"b"       → [0x02]['b'][0x00]
"aaaaaaa" → [0x02]['a']['a']['a']['a']['a']['a']['a'][0x00]

Byte-by-byte comparison:

  1. Type tags match: 0x02 == 0x02
  2. First character: 'a' (0x61) < 'b' (0x62)
  3. Result: "aaaaaaa" < "b" CORRECT!

The Embedded Null Problem

But wait—what if the string itself contains a null byte?

"a\0b" → [0x02]['a'][0x00]['b'][0x00]
                      ^^^^
                      Is this the terminator or part of the string?

We need an escape sequence.

Escape Sequence Magic

The trick: encode embedded nulls as 0x00 0xFF, and use a bare 0x00 as the terminator:

Value::Text(s) => {
    buf.push(0x02);
    for &byte in s.as_bytes() {
        if byte == 0x00 {
            buf.push(0x00);
            buf.push(0xFF);  // Escape embedded null
        } else {
            buf.push(byte);
        }
    }
    buf.push(0x00);  // Terminator
}

Now:

"a\0b" → [0x02]['a'][0x00 0xFF]['b'][0x00]
                      ^^^^^^^       ^^^^
                      escaped null  terminator

"a"    → [0x02]['a'][0x00]
                     ^^^^
                     terminator

Why does this preserve ordering? Because 0x00 0xFF sorts after 0x00:

  • Terminator: [0x00]
  • Escaped null: [0x00][0xFF]

When comparing byte-by-byte:

  1. "a"[0x02]['a'][0x00]
  2. "a\0b"[0x02]['a'][0x00 0xFF]['b'][0x00]
  3. At position 2: 0x00 < 0x00 (equal)
  4. At position 3: nothing vs 0xFF
  5. Result: "a" < "a\0b" CORRECT!

The Implementation

Encoding (lines 240-248 in key_encoder.rs)

Value::Text(s) => {
    buf.push(0x02);
    for &byte in s.as_bytes() {
        if byte == 0x00 {
            buf.push(0x00);
            buf.push(0xFF);  // Escape embedded null
        } else {
            buf.push(byte);
        }
    }
    buf.push(0x00);  // Terminator
}

Decoding (lines 325-348)

fn decode_text_value(bytes: &[u8], pos: &mut usize) -> Value {
    let mut result = Vec::new();

    while *pos < bytes.len() {
        debug_assert!(*pos <= bytes.len(), "position out of bounds");
        let byte = bytes[*pos];
        *pos += 1;

        if byte == 0x00 {
            if *pos < bytes.len() && bytes[*pos] == 0xFF {
                result.push(0x00);  // Escaped null
                *pos += 1;
            } else {
                break;  // Terminator
            }
        } else {
            result.push(byte);
        }
    }

    let s = std::str::from_utf8(&result)
        .expect("Text decode failed: invalid UTF-8");
    Value::Text(s.to_string())
}

Verification: Property-Based Testing

We can’t just test a few examples. We need property-based tests that verify ordering holds for arbitrary strings.

Using proptest, we generate thousands of random string pairs and verify the invariant:

proptest! {
    #[test]
    fn text_ordering_preserved(
        a in "[\\x00-\\x7F]{0,50}",  // Any ASCII string up to 50 chars
        b in "[\\x00-\\x7F]{0,50}"
    ) {
        let key_a = encode_key(&[Value::Text(a.clone())]);
        let key_b = encode_key(&[Value::Text(b.clone())]);

        // The encoded keys must have the same ordering as the strings
        prop_assert_eq!(a.cmp(&b), key_a.cmp(&key_b));
    }
}

We run this with 10,000 iterations:

$ PROPTEST_CASES=10000 cargo test text_ordering_preserved

running 1 test
test tests::property_tests::text_ordering_preserved ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; finished in 1.27s

10,000 random string pairs, all ordering preserved.

We do the same for bytes:

proptest! {
    #[test]
    fn bytes_ordering_preserved(a: Vec<u8>, b: Vec<u8>) {
        let key_a = encode_key(&[Value::Bytes(Bytes::from(a.clone()))]);
        let key_b = encode_key(&[Value::Bytes(Bytes::from(b.clone()))]);
        prop_assert_eq!(a.cmp(&b), key_a.cmp(&key_b));
    }
}

10,000 random byte sequences, all ordering preserved.

Edge Cases

We added comprehensive unit tests for edge cases that property tests might miss:

The Original Bug Case

#[test]
fn test_text_ordering_original_bug_case() {
    let short = encode_key(&[Value::Text("b".to_string())]);
    let long = encode_key(&[Value::Text("aaaaaaa".to_string())]);
    assert!(long < short, "aaaaaaa should be < b");
}

Embedded Nulls

#[test]
fn test_text_with_embedded_nulls() {
    let cases = ["abc", "a\0bc", "a\0\0bc", "\0abc", "abc\0"];

    // Round-trip: encode then decode should return original
    for s in &cases {
        let key = encode_key(&[Value::Text(s.to_string())]);
        let decoded = decode_key(&key);
        assert_eq!(decoded, vec![Value::Text(s.to_string())]);
    }

    // Ordering: encoded keys must sort the same as strings
    let keys: Vec<_> = cases.iter()
        .map(|s| encode_key(&[Value::Text(s.to_string())]))
        .collect();

    for i in 0..keys.len() - 1 {
        assert_eq!(
            cases[i].cmp(cases[i + 1]),
            keys[i].cmp(&keys[i + 1])
        );
    }
}

Empty Strings

#[test]
fn test_empty_text_and_bytes() {
    let text = encode_key(&[Value::Text("".to_string())]);
    let bytes = encode_key(&[Value::Bytes(Bytes::new())]);

    assert_eq!(decode_key(&text), vec![Value::Text("".to_string())]);
    assert_eq!(decode_key(&bytes), vec![Value::Bytes(Bytes::new())]);
}

High Byte Values

#[test]
fn test_bytes_ordering_with_high_byte_values() {
    let cases: &[&[u8]] = &[
        &[0x00],
        &[0x00, 0x00],
        &[0x01],
        &[0x7F],
        &[0xFF],
        &[0xFF, 0x00],
        &[0xFF, 0xFE],
        &[0xFF, 0xFF],
    ];

    let keys: Vec<_> = cases.iter()
        .map(|&data| encode_key(&[Value::Bytes(Bytes::from(data))]))
        .collect();

    for i in 0..keys.len() - 1 {
        assert_eq!(
            cases[i].cmp(cases[i + 1]),
            keys[i].cmp(&keys[i + 1])
        );
    }
}

All tests pass.

The Bonus: Space Savings

Length-prefix encoding uses 4 bytes for the length:

"hello" → [type:1][len:4][data:5] = 10 bytes

Null-termination uses 1 byte for the terminator:

"hello" → [type:1][data:5][term:1] = 7 bytes

We save 3 bytes per string in the common case (no embedded nulls).

Worst case (string is all nulls):

"\0\0\0" → [type:1][0x00 0xFF][0x00 0xFF][0x00 0xFF][term:1] = 8 bytes

That’s 2× the original size plus 1 byte. But in practice, strings with many embedded nulls are rare.

For typical text data (names, descriptions, JSON keys), we get both:

  • Correct ordering
  • Smaller keys (−30% for 10-char strings)

Breaking Change Notice

This fix changes the on-disk format for text and bytes keys. Existing indexes must be rebuilt.

For a compliance-first database in pre-release, this is acceptable—we’d rather fix correctness bugs now than after production deployments.

If we needed backward compatibility, we could:

  1. Add a version byte to the key format
  2. Support both encodings during a transition period
  3. Provide a migration utility

But for now, we’re prioritizing correctness over compatibility. Build it right first.

Lessons Learned

1. Length-Prefix ≠ Lexicographic

Length-prefix encoding is great for:

  • Parsing (know how many bytes to read)
  • Validation (detect truncation)
  • Self-describing formats (protobufs, msgpack)

But it’s terrible for:

  • Sortable keys (length dominates comparison)
  • B+Trees (need byte-wise lexicographic ordering)
  • Range scans (breaks ordering invariants)

Use the right encoding for the job.

2. Property Tests Catch Ordering Bugs

Unit tests might check "a" < "b", but miss "aaaaaaa" < "b". Property tests generate thousands of random cases, including edge cases like:

  • Empty strings
  • Single characters
  • Very long strings
  • Strings with embedded nulls
  • Strings with high byte values

Property tests found subtle cases our hand-written tests missed.

3. Learn From Production Systems

We didn’t invent null-terminated encoding. We copied it from FoundationDB, which has been battle-tested in production for years.

When building a database, steal from the best:

  • FoundationDB: tuple encoding, deterministic simulation
  • SQLite: B+Tree design, query planner
  • PostgreSQL: MVCC, WAL design
  • RocksDB: LSM trees, bloom filters

Standing on the shoulders of giants means fewer bugs and faster progress.

4. Assertions Are Documentation

Our decode function has assertions like:

debug_assert!(*pos <= bytes.len(), "position out of bounds");
debug_assert!(
    std::str::from_utf8(&result).is_ok(),
    "decoded text must be valid UTF-8"
);

These serve two purposes:

  1. Catch bugs early (in debug builds)
  2. Document invariants (what must be true at this point)

Following Pressurecraft’s “assertion density” principle, we aim for 2+ assertions per function. If you can’t assert it, you might not understand it.

5. Fix Bugs With Tests

We didn’t just fix the code—we added tests that would have caught the bug:

  1. test_text_ordering_original_bug_case - the exact bug that triggered this work
  2. Property tests with 10,000 iterations
  3. Edge case tests for embedded nulls, empty strings, etc.

Now if someone accidentally reverts the fix, the tests will fail immediately.

What’s Next

With text/bytes ordering fixed, we can:

  1. Add composite indexes - multi-column indexes now work correctly
  2. Enable range scans - WHERE name >= 'A' AND name < 'B' returns correct results
  3. Add LIKE optimization - WHERE name LIKE 'A%' can use indexes
  4. Test with real data - load realistic datasets and verify ordering

The fix also unblocked work on:

  • Secondary indexes on text columns
  • Full-text search (needs correct ordering for prefix scans)
  • JSON path indexes (JSON keys are strings)

Try It Yourself

# Clone Kimberlite
git clone https://github.com/yourusername/kimberlite
cd kimberlite

# Run the tests
just test-one test_text_ordering_original_bug_case

# Run property tests with 10k iterations
PROPTEST_CASES=10000 just test-one text_ordering_preserved

Create a table with text indexes and verify ordering:

use kimberlite::*;

let db = Kimberlite::open("test.db")?;
let tenant = db.create_tenant("test")?;

tenant.execute("
    CREATE TABLE users (
        id INTEGER PRIMARY KEY,
        name TEXT
    )
", &[])?;

tenant.execute("INSERT INTO users VALUES (1, 'aaaaaaa')", &[])?;
tenant.execute("INSERT INTO users VALUES (2, 'b')", &[])?;
tenant.execute("INSERT INTO users VALUES (3, 'a')", &[])?;

let rows = tenant.query("SELECT * FROM users ORDER BY name", &[])?;

// Should return: a, aaaaaaa, b (correct lexicographic order)
for row in rows {
    println!("{}", row.get::<String>("name")?);
}

Output:

a
aaaaaaa
b

Correct ordering.

The Numbers

Test Results

Unit tests (key_encoder):           13/13 passed
Property tests (text_ordering):     10,000/10,000 passed
Property tests (bytes_ordering):    10,000/10,000 passed
Total kimberlite-query tests:              276/277 passed (1 ignored)

Space Savings

String Length Before (length-prefix) After (null-term) Savings
5 chars 10 bytes 7 bytes −30%
10 chars 15 bytes 12 bytes −20%
50 chars 55 bytes 52 bytes −5%
Empty string 5 bytes 2 bytes −60%

For typical application data (names, emails, short descriptions), we save 20-30% on key size.

Performance

Encoding: O(n) → O(n) (unchanged) Decoding: O(n) → O(n) (unchanged)

No regression. The escape sequence check is a simple if byte == 0x00, which compiles to a single comparison.


Found an issue? Open a PR. We’re building in public—bugs, fixes, and lessons learned.

Want more stories about building Kimberlite? Check out our post on how VOPR (our deterministic simulator) found five subtle bugs in our linearizability checker, and learn about our Pressurecraft coding philosophy.


Further Reading