Performance Guidelines
On this page
- Table of Contents
- Philosophy
- Correctness First, Then Performance
- Measure Before Optimizing
- Predictable is Better Than Fast
- Optimization Priority Order
- Why This Order?
- Practical Implications
- Mechanical Sympathy
- CPU Cache Hierarchy
- Cache-Friendly Data Structures
- Page Alignment
- Batching and Amortization
- Write Batching
- Network Batching
- Read Batching (Prefetching)
- Zero-Copy Patterns
- Bytes Instead of Vec
- Borrowed vs Owned APIs
- Memory-Mapped Reads
- Memory Layout
- Struct Layout
- Arena Allocation
- Small String Optimization
- I/O Patterns
- Sequential vs Random
- Direct I/O
- Modeling and Measurement
- Little’s Law for Capacity Planning
- Benchmarking
- Criterion for Micro-Benchmarks
- Latency Histograms
- Empirical Cumulative Distribution Function (eCDF)
- Load Testing
- Anti-Patterns
- Avoid: Allocation in Hot Paths
- Avoid: Unbounded Queues
- Avoid: Sync in Loop
- Avoid: String Formatting in Hot Paths
- Avoid: Clone When Borrow Suffices
- Summary
- Benchmark Infrastructure
- Running Benchmarks
- Benchmark Targets
- Profiling
- Hardware Acceleration
- Enabling CPU-Specific Optimizations
- Verifying Hardware Acceleration
- Expected Speedups
- I/O Performance
- Group Commit Configuration
- Checkpoints
- Latency Monitoring
- Built-in Histograms
- Metrics to Track
- Avoiding Coordinated Omission
- Current Performance Characteristics (v0.2.0)
- Baseline Measurements
- Known Bottlenecks
- Performance Philosophy
- Optimization Roadmap
Kimberlite prioritizes correctness over performance, but that doesn’t mean we ignore performance. This document describes our performance philosophy, optimization priorities, and guidelines for writing efficient code.
Table of Contents
- Philosophy
- Optimization Priority Order
- Mechanical Sympathy
- Batching and Amortization
- Zero-Copy Patterns
- Memory Layout
- I/O Patterns
- Benchmarking
- Anti-Patterns
Philosophy
Correctness First, Then Performance
Performance optimizations are worthless if the system is incorrect. Our priority order:
- Correct: The system does what it’s supposed to do
- Clear: The code is understandable and maintainable
- Fast: The system performs well
Never sacrifice correctness or clarity for performance without explicit justification.
Measure Before Optimizing
“Premature optimization is the root of all evil” — Donald Knuth
Before optimizing:
- Profile: Identify the actual bottleneck
- Measure: Establish a baseline
- Optimize: Make targeted changes
- Verify: Confirm improvement without regression
Predictable is Better Than Fast
A system with predictable latency is often better than one with lower average latency but high variance:
System A: p50=5ms, p99=8ms, p99.9=12ms ← Preferred
System B: p50=3ms, p99=50ms, p99.9=500ms
Design for consistent performance, not peak performance.
Optimization Priority Order
When optimizing, work through resources in this order:
1. Network Most expensive, highest latency
↓
2. Disk Order of magnitude faster than network
↓
3. Memory Order of magnitude faster than disk
↓
4. CPU Order of magnitude faster than memory
Why This Order?
| Resource | Latency | Bandwidth | Cost to Improve |
|---|---|---|---|
| Network | ~1ms | ~1 GB/s | Very expensive |
| SSD | ~100μs | ~3 GB/s | Expensive |
| Memory | ~100ns | ~50 GB/s | Moderate |
| CPU | ~1ns | N/A | Cheap (just wait) |
An optimization that reduces network round-trips will almost always beat one that reduces CPU cycles.
Practical Implications
Network: Batch requests, use persistent connections, minimize round-trips
// Bad: N round-trips
for item in items
// Good: 1 round-trip
client.write_batch.await?;
Disk: Sequential I/O, batch writes, use fsync strategically
// Bad: Sync after every write
for record in records
// Good: Batch writes, sync once
for record in records
file.sync_all?;
Memory: Avoid allocations in hot paths, reuse buffers
// Bad: Allocate per iteration
for item in items
// Good: Reuse buffer
let mut buffer = Vecwith_capacity;
for item in items
CPU: Last resort; usually not the bottleneck
// Only optimize CPU when profiling shows it's the bottleneck
// and after all higher-priority optimizations are done
Mechanical Sympathy
Understanding how hardware works leads to better performance.
CPU Cache Hierarchy
┌─────────────────────────────────────────────────────────────┐
│ CPU Core │
│ ┌─────────┐ │
│ │ L1 Cache│ 32KB, ~4 cycles │
│ │ (data) │ │
│ └────┬────┘ │
│ ▼ │
│ ┌─────────┐ │
│ │ L2 Cache│ 256KB, ~12 cycles │
│ └────┬────┘ │
│ ▼ │
│ ┌─────────┐ │
│ │ L3 Cache│ 8-30MB, ~40 cycles (shared) │
│ └────┬────┘ │
│ ▼ │
│ ┌─────────┐ │
│ │ RAM │ ~100+ cycles │
│ └─────────┘ │
└─────────────────────────────────────────────────────────────┘
Implications:
- Keep hot data small (fits in L1/L2)
- Access data sequentially when possible (prefetching)
- Avoid pointer chasing (linked lists are cache-hostile)
Cache-Friendly Data Structures
// Cache-hostile: Each node is a separate allocation
// Cache-friendly: Contiguous memory
For B+trees, we use:
- Large node sizes (4KB = page size)
- Contiguous arrays within nodes
- Predictable memory access patterns
Page Alignment
Storage is organized into 4KB pages matching OS page size:
const PAGE_SIZE: usize = 4096;
Benefits:
- Aligned I/O is faster
- Direct I/O (O_DIRECT) requires alignment
- Memory-mapped I/O works at page granularity
Batching and Amortization
Batching amortizes fixed costs across multiple operations.
Write Batching
// Without batching: O(n) fsyncs
async
// With batching: O(1) fsyncs
async
Network Batching
// Consensus with batching
Read Batching (Prefetching)
// Prefetch likely-needed pages
Zero-Copy Patterns
Avoid copying data when possible.
Bytes Instead of Vec
use Bytes;
// Copying: Each read creates a new Vec
// Zero-copy: Bytes is reference-counted
Borrowed vs Owned APIs
// Prefer borrowing in read-only operations
Memory-Mapped Reads
For large sequential reads, memory mapping avoids copies:
use MmapOptions;
// Access data directly from kernel page cache
let data = &mmap;
Memory Layout
How data is laid out in memory affects performance.
Struct Layout
// Bad: Poor alignment, padding wasted
// Total: 32 bytes, but only 18 bytes of data
// Good: Ordered by size, minimal padding
// Total: 24 bytes, same 18 bytes of data
Arena Allocation
For many small, same-lifetime allocations:
use Bump;
Small String Optimization
For short strings, avoid heap allocation:
use String as SmartString;
// Standard String: Always heap-allocated
let s: String = "hello".to_string; // Heap allocation
// SmartString: Inline for short strings
let s: SmartString = "hello".into; // Inline, no allocation
I/O Patterns
Efficient I/O is critical for database performance.
Sequential vs Random
Sequential read: ~500 MB/s (SSD), ~150 MB/s (HDD)
Random read: ~50 MB/s (SSD), ~1 MB/s (HDD)
The append-only log is designed for sequential I/O:
- Writes are always sequential (append)
- Reads during recovery are sequential (scan)
- Random reads go through indexed projections
Direct I/O
Bypass the kernel page cache for predictable latency:
use OpenOptionsExt;
let file = new
.read
.write
.custom_flags
.open?;
When to use:
- Large sequential writes (log segments)
- When you manage your own cache
When NOT to use:
- Small random reads (kernel cache helps)
- Reads that benefit from prefetching
Modeling and Measurement
Little’s Law for Capacity Planning
C = T × L (Concurrency = Throughput × Latency)
Little’s Law is a fundamental queueing theory principle that helps size bounded queues and thread pools.
Application to Kimberlite:
- Target: 100K appends/sec with p99 < 10ms latency
- Required concurrency: C = 100K/sec × 0.01sec = 1000 concurrent operations
- Monitor: If actual concurrency >> 1000, latency is degrading (queue buildup)
- Use: Size bounded queues and thread pools based on Little’s Law
Implementation:
// Bounded channel sized by Little's Law
let target_throughput = 100_000; // ops/sec
let target_latency_sec = 0.01; // 10ms
let max_concurrent = as usize;
let = bounded_channel;
Validation: Track actual queue depth in metrics. If depth approaches capacity under normal load, either increase capacity or reduce latency to maintain throughput.
Reference: From “Latency: Reduce delay in software systems” by Pekka Enberg (ScyllaDB, Turso)
Benchmarking
Criterion for Micro-Benchmarks
use ;
criterion_group!;
criterion_main!;
Latency Histograms
Track latency distribution, not just averages:
use Histogram;
Empirical Cumulative Distribution Function (eCDF)
eCDF visualizes latency distribution more clearly than histograms for identifying tail latency issues:
- Flat line = consistent latency (good)
- Steep curve at tail = tail latency spike (investigate)
Pattern to detect:
- 99% of requests < 5ms, but p99.9 = 100ms → investigate outliers
- eCDF shows exact percentile for any latency value
Implementation:
/// Export HDR histogram to eCDF format for plotting
/// Save eCDF data to CSV for Grafana/plotting
Grafana Integration: Plot eCDF curves over time to detect latency regressions visually. A shift right in the eCDF curve indicates degradation.
Load Testing
For end-to-end performance:
# Run load test with varying client counts
for; do
done
Anti-Patterns
Avoid: Allocation in Hot Paths
// Bad: Allocates on every call
// Good: Reuse buffer
Avoid: Unbounded Queues
// Bad: Can grow without limit
let = unbounded_channel;
// Good: Backpressure when queue is full
let = bounded_channel;
Avoid: Sync in Loop
// Bad: N fsyncs
for record in records
// Good: 1 fsync
for record in records
fsync?;
Avoid: String Formatting in Hot Paths
// Bad: Allocates string
debug!;
// Good: Use structured logging
debug!;
Avoid: Clone When Borrow Suffices
// Bad: Unnecessary clone
process;
// Good: Borrow
process;
Summary
Kimberlite’s performance philosophy:
- Correctness first: Never sacrifice correctness for speed
- Network → Disk → Memory → CPU: Optimize in this order
- Batch operations: Amortize fixed costs
- Zero-copy: Avoid unnecessary data movement
- Measure everything: Profile before optimizing
- Predictable latency: Prefer consistency over peak performance
When in doubt, write correct, clear code first. Optimize only when profiling shows it’s necessary, and only the specific bottleneck identified.
Benchmark Infrastructure
Running Benchmarks
# Run all benchmarks
# Run specific benchmark group
# Compare against baseline (CI mode)
# Generate HTML report
Benchmark Targets
These targets guide optimization efforts and CI regression detection:
| Operation | Target | Regression Threshold |
|---|---|---|
chain_hash (SHA-256, 1 KiB) | 500 MB/s | -10% |
internal_hash (BLAKE3, 1 KiB) | 5 GB/s | -10% |
internal_hash_parallel (BLAKE3, 1 MiB) | 15 GB/s | -10% |
encrypt (AES-256-GCM, 1 KiB) | 2 GB/s | -10% |
record_to_bytes | 1M ops/s | -10% |
append_batch(100, fsync=false) | 500K TPS | -20% |
append_batch(100, fsync=true) | 100K TPS | -20% |
apply_committed | 500K ops/s | -20% |
read_record (with index) | < 100μs p99 | +50% |
Profiling
# Generate flamegraph (requires cargo-flamegraph)
# Interactive profiling with samply
# Linux perf profiling
# Latency histogram report
Hardware Acceleration
Enabling CPU-Specific Optimizations
Add to .cargo/config.toml:
# Enable native CPU features (AES-NI, AVX2, etc.)
[target.'cfg(any(target_arch = "x86_64", target_arch = "aarch64"))']
rustflags = ["-Ctarget-cpu=native"]
# For reproducible builds, target a specific CPU level:
# [target.x86_64-unknown-linux-gnu]
# rustflags = ["-Ctarget-cpu=haswell"] # AES-NI + AVX2
Verifying Hardware Acceleration
# Check if AES-NI is used
|
# Compare hardware vs software performance
RUSTFLAGS="-Ctarget-cpu=generic"
RUSTFLAGS="-Ctarget-cpu=native"
Expected Speedups
| Feature | CPU Requirement | Speedup |
|---|---|---|
| AES-NI | Intel Westmere+ (2010), AMD Bulldozer+ | 10-20x for AES-GCM |
| ARMv8 Crypto | Apple M1+, AWS Graviton2+ | 10-15x for AES-GCM |
| AVX2 | Intel Haswell+ (2013), AMD Excavator+ | 2-3x for BLAKE3 |
| AVX-512 | Intel Skylake-X+, AMD Zen4+ | 1.5-2x over AVX2 |
I/O Performance
Group Commit Configuration
Kimberlite supports configurable fsync strategies to balance durability and throughput:
Trade-off guidance:
- Compliance-critical: Use
EveryBatch(each batch is durable) - High-throughput: Use
GroupCommit(5ms)(lose up to 5ms of data on crash) - Testing/Development: Use
OnFlush(fastest, not durable)
Checkpoints
Checkpoints enable fast recovery and verified reads without full log replay:
Checkpoints are created:
- Every 10,000-100,000 records (configurable)
- On graceful shutdown
- On explicit request
Latency Monitoring
Built-in Histograms
Kimberlite tracks latency distributions for critical operations using HDR histograms:
// Record a latency measurement
metrics.record;
// Get percentiles
let report = metrics.report;
println!;
Metrics to Track
| Metric | Description | Target p99 |
|---|---|---|
append_latency_ns | Time to append + fsync | < 1ms (SSD) |
read_latency_ns | Time to read single record | < 100μs |
encrypt_latency_ns | Time to encrypt record | < 10μs |
hash_latency_ns | Time to compute chain hash | < 5μs |
apply_latency_ns | Time for kernel apply | < 1μs |
projection_lag_ns | Delay from commit to projection | < 10ms |
Avoiding Coordinated Omission
When benchmarking, avoid the “coordinated omission” trap:
- Use closed-loop benchmarks with arrival rate tracking
- Record intended send time, not just response time
- Use HDR histogram’s correction for queue delays
Current Performance Characteristics (v0.2.0)
This section documents the baseline performance characteristics of Kimberlite v0.2.0, establishing a reference point for future optimization work.
Baseline Measurements
Append Performance (single-threaded, sync I/O):
- Throughput: ~10K events/sec
- Latency: Unmeasured (benchmark infrastructure configured but unused)
- Bottleneck: Synchronous fsync per batch
Read Performance:
- Random reads: Unmeasured
- Sequential scans: Unmeasured
- Bottleneck: Full-file reads on queries (
storage.rs:368)
Verification Performance:
- Hash chain verification: O(n) from genesis by default
- Checkpoint optimization: Available but not default behavior
- Verified read performance: Depends on distance to nearest checkpoint
Index Performance:
- Index writes: Per batch (
storage.rs:291) - No batching or amortization
- Potential for 10-100x improvement with optimized write strategy
Known Bottlenecks
The following bottlenecks have been identified and documented for future optimization:
Storage I/O (HIGHEST IMPACT):
- Full-file reads on every query (
storage.rs:368) - Index written after EVERY batch (
storage.rs:291) - O(n) verification from offset 0 by default
- No mmap or async I/O
- Full-file reads on every query (
Crypto Configuration (QUICK WIN):
- Missing SIMD/hardware acceleration features in dependencies
- Cipher instantiation per operation (
encryption.rs:937,987) - Hardware features available but not enabled (AES-NI, SHA extensions)
State Management (MEDIUM):
- BTreeMap with String keys instead of HashMap (
state.rs:56) - No LRU caching for hot metadata
- Inefficient for high-throughput workloads
- BTreeMap with String keys instead of HashMap (
Benchmark Infrastructure (FOUNDATIONAL):
- Criterion and hdrhistogram configured but completely unused
- No
benches/directories in crates - No regression detection in CI
- Cannot measure impact of optimizations
Performance Philosophy
Current Priority: Correctness > Performance
v0.2.0 focuses on:
- Byzantine-resistant consensus
- Cryptographic integrity
- Compliance guarantees
- Comprehensive testing
Performance optimizations are deferred to ensure correctness is established first.
Measurement Before Optimization:
Before any performance work, we will:
- Establish baseline benchmarks with Criterion
- Add latency instrumentation (p50/p90/p99/p999 with hdrhistogram)
- Profile hot paths with flamegraphs
- Identify actual bottlenecks with data
Premature optimization is avoided. All optimization decisions will be data-driven.
Optimization Roadmap
Planned performance improvements are documented in ROADMAP.md.
High-Priority Optimizations (v0.3.0):
- Enable crypto hardware acceleration (2-3x improvement)
- Benchmark infrastructure (baseline metrics)
- Index write batching (10-100x fewer writes)
- HashMap for hot paths (O(1) vs O(log n))
Medium-Priority Optimizations (v0.4.0):
- Memory-mapped log files (mmap)
- LRU cache for metadata
- Direct I/O for append path
- Advanced cache replacement (SIEVE)
Long-Term Optimizations (v0.5.0+):
- io_uring async I/O (Linux)
- Thread-per-core architecture
- Tenant-level parallelism
- Zero-copy deserialization
See ROADMAP.md for detailed optimization phases and target metrics.
Document Version: 2.0 (Roadmap Extracted) Last Updated: 2026-02-02 Current State: Trimmed to current practices only (was ~3,900 lines, now ~900 lines)