Storage - Append-Only Log
On this page
- Design Principles
- Directory Structure
- Segment Structure
- Record Format
- Field Details
- Hash Chaining
- Write Path
- Read Path
- Segment Management
- Segment Creation
- Segment Sealing
- Segment Compaction (Future)
- Checkpoints
- Performance Characteristics
- Fault Tolerance
- Corruption Detection
- Torn Writes
- Disk Failure
- Testing
- Related Documentation
The append-only log is Kimberlite’s storage foundation.
Design Principles
- Append-only: Never modify or delete entries
- Sequential writes: Always append to the end
- Checksummed: CRC32 on every entry
- Hash-chained: Each entry links to previous
- Segment-based: Split into manageable files
Directory Structure
data/
├── log/
│ ├── 00000000.segment # Segment 0: positions 0-999999
│ ├── 00000001.segment # Segment 1: positions 1000000-1999999
│ ├── 00000002.segment # Segment 2: current active segment
│ └── index.meta # Segment index metadata
├── checkpoints/
│ ├── checkpoint-1000000 # Checkpoint at position 1M
│ └── checkpoint-2000000 # Checkpoint at position 2M
└── tenants/
├── 00000001/ # Tenant 1 data
└── 00000002/ # Tenant 2 data
Segment Structure
Segments are fixed-size files (default: 1GB):
┌─────────────────────────────────────────────────────────────────┐
│ Segment File │
│ │
│ ┌──────────┬──────────┬──────────┬──────────┬─────────────┐ │
│ │ Record 0 │ Record 1 │ Record 2 │ ... │ Record N │ │
│ └──────────┴──────────┴──────────┴──────────┴─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Properties:
- Sequential writes (append-only)
- Preallocated (reduces fragmentation)
- Immutable once sealed
- Checksummed per-segment
Record Format
Each record is self-describing:
┌──────────────────────────────────────────────────────────────────┐
│ Record (on disk) │
│ │
│ ┌─────────┬──────────┬───────────┬──────────┬────────────────┐ │
│ │ Length │ Checksum │ Prev Hash │ Metadata │ Data │ │
│ │ (4 B) │ (4 B) │ (32 B) │ (var) │ (var) │ │
│ └─────────┴──────────┴───────────┴──────────┴────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘
Field Details
Length (4 bytes)
- Type: u32 (little-endian)
- Purpose: Total record size in bytes
- Range: 40 bytes (minimum) to 16MB (maximum)
- Not checksummed: Allows reading length before validating
Checksum (4 bytes)
- Type: CRC32 (IEEE polynomial)
- Purpose: Integrity verification
- Covers: All fields except length (prev_hash + metadata + data)
- Validation: On read, recompute and compare
Prev Hash (32 bytes)
- Type: SHA-256 hash
- Purpose: Hash of previous record (tamper-evident chain)
- Special case: First record has all-zero prev_hash
Metadata (variable)
- Position: u64 (8 bytes)
- Tenant ID: u64 (8 bytes)
- Stream ID: u64 (8 bytes)
- Timestamp: i64 (8 bytes, Unix timestamp in microseconds)
- Event Type: u16 (2 bytes)
- Reserved: 6 bytes for future use
- Total: 40 bytes
Data (variable)
- Type: Arbitrary bytes
- Purpose: Application payload
- Serialization: Determined by event type (typically postcard or bincode)
- Compression: Optional (zstd for large payloads)
Hash Chaining
Every record includes the hash of the previous record:
Record N-1 Record N Record N+1
┌─────────┐ ┌─────────┐ ┌─────────┐
│ data │ │ data │ │ data │
│ hash ───┼────────►│ prev │ │ prev │
│ = H1 │ │ hash ───┼────────►│ hash │
│ │ │ = H1 │ │ = H2 │
│ │ │ hash │ │ hash │
│ │ │ = H2 │ │ = H3 │
└─────────┘ └─────────┘ └─────────┘
H1 = SHA256(Record N-1)
H2 = SHA256(Record N)
H3 = SHA256(Record N+1)
Verification:
Write Path
Appending to the log:
Performance optimizations:
- Batch writes: Append multiple records before fsync
- Buffered IO: Use 4KB buffers aligned to page size
- Direct IO: Use
O_DIRECTto bypass page cache (optional) - Group commit: Batch fsync for multiple appends
Read Path
Reading from the log:
Segment Management
Segment Creation
New segment created when:
- Current segment reaches 1GB (default)
- Manually triggered via
seal_segment()
Segment Sealing
Once sealed, segments become immutable:
Segment Compaction (Future)
Old segments can be compacted:
// Planned for v0.6.0
Checkpoints
Checkpoints capture state at a specific position:
┌────────────────────────────────────────────────────────────┐
│ Checkpoint File │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Metadata │ │
│ │ - Position: 1000000 │ │
│ │ - Timestamp: 2024-01-15 10:30:00 │ │
│ │ - Hash: abc123... │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ State Snapshot │ │
│ │ - Tenants: [...] │ │
│ │ - Tables: [...] │ │
│ │ - Projections: [...] │ │
│ └────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────┘
Usage:
- Speed up recovery (replay from checkpoint, not from genesis)
- Enable log compaction (delete segments before checkpoint)
Status: Planned for v0.6.0
Performance Characteristics
Write throughput:
- Single-threaded: 50k-100k ops/sec
- Bottleneck: Disk bandwidth (500 MB/s SSD → ~50k ops/sec at 10KB/op)
- With group commit: 100k-200k ops/sec
Read throughput:
- Sequential scan: 1-2 GB/sec (SSD bandwidth limited)
- Random reads: 50k-100k ops/sec
Latency:
- Write (no fsync): <1ms
- Write (with fsync): 1-10ms (depends on disk)
- Read (from page cache): <0.1ms
- Read (from disk): 0.5-5ms
Fault Tolerance
Corruption Detection
Every read verifies:
- CRC32 checksum (detect bit flips)
- Hash chain (detect tampering)
Torn Writes
If power fails mid-write:
Disk Failure
If disk fails:
- VSR replicates to other nodes
- Repair from healthy replica
See Consensus for details.
Testing
Storage is tested extensively:
- Unit tests: CRC32, hash chaining, segment management
- Property tests: Append/read round-trips
- Corruption tests: Inject bit flips, verify detection
- VOPR scenarios: 3 scenarios test corruption handling
See VOPR Scenarios - Phase 2.
Related Documentation
- Data Model - How the log maps to events
- Cryptography - Hash algorithms used
- Testing Overview - Storage testing strategies
Key Takeaway: The append-only log is the foundation of Kimberlite. Sequential writes, checksums, and hash chains provide durability, integrity, and tamper evidence.