Clock Synchronization | Kimberlite Docs

On this page

Overview
Why Cluster-Wide Synchronization?
Design Principles
Marzullo’s Algorithm
Algorithm Steps
Example
Implementation Details
Clock Offset Calculation
Epoch-Based Synchronization
Configuration Parameters
Formal Verification
TLA+ Specification (specs/tla/ClockSync.tla)
Kani Proofs (5 proofs)
Production Assertions
VOPR Testing (4 scenarios)
1. ClockDrift
2. ClockOffsetExceeded
3. ClockNtpFailure
4. ClockBackwardJump
HIPAA/GDPR Compliance Impact
Before Phase 1.1
After Phase 1.1
Error Handling
Clock Errors
Degradation Modes
Performance Characteristics
References
Academic Papers
Industry Implementations
Internal Documentation
Future Work (Phase 2+)

Module: crates/kimberlite-vsr/src/clock.rs, crates/kimberlite-vsr/src/marzullo.rs TLA+ Spec: specs/tla/ClockSync.tla Kani Proofs: crates/kimberlite-vsr/src/kani_proofs.rs (Proofs 21-25) VOPR Scenarios: 4 scenarios (ClockDrift, ClockOffsetExceeded, ClockNtpFailure, ClockBackwardJump)

Overview

Kimberlite implements cluster-wide clock synchronization to provide accurate, consensus-based timestamps for HIPAA/GDPR audit trails. We use Marzullo’s algorithm (1984) to find the smallest time interval consistent with quorum agreement across all replicas.

Why Cluster-Wide Synchronization?

Time is critical for compliance-first databases:

HIPAA requires accurate audit timestamps for PHI access
GDPR requires timestamp precision for data access logs
SOC 2 requires reliable temporal ordering of security events

We cannot rely on:

Client clocks (untrusted, may be manipulated)
Individual replica clocks (subject to drift)
NTP alone (can fail or be misconfigured)

Instead, we use consensus to drive time: only the primary assigns timestamps after achieving cluster-wide clock synchronization.

Design Principles

Only primary assigns timestamps: Prevents replica disagreement
Monotonicity always preserved: timestamp = max(current_time, last_timestamp)
Cluster consensus required: Timestamp must be consistent with quorum clocks
NTP-independent high availability: Continues working if NTP fails
Bounded uncertainty: Clock offset stays within 500ms tolerance

Marzullo’s Algorithm

Marzullo’s algorithm (1984) selects time sources for estimating accurate time from multiple noisy measurements. Given a set of clock samples (each with an interval [lower, upper] representing uncertainty), it finds the smallest interval consistent with the largest number of sources.

Algorithm Steps

Create tuples: For each replica’s clock sample, create two tuples:
- Lower bound: (offset - error_margin, "start", replica_id)
- Upper bound: (offset + error_margin, "end", replica_id)
Sort tuples: Sort by offset, then by bound type (lower before upper)
Sweep algorithm: Scan sorted tuples left-to-right:
- Increment count when crossing a “start” tuple
- Decrement count when crossing an “end” tuple
- Track maximum overlap count and corresponding interval
Quorum check: Verify the best interval has ≥ quorum sources
Tolerance check: Verify interval width ≤ CLOCK_OFFSET_TOLERANCE_MS (500ms)

Example

Given 3 replicas with clock samples:

Replica 0: offset = 0ms ± 50ms → [-50, 50]
Replica 1: offset = 20ms ± 30ms → [-10, 50]
Replica 2: offset = 30ms ± 40ms → [-10, 70]

Sorted tuples:

(-50, start, 0), (-10, start, 1), (-10, start, 2), (50, end, 0), (50, end, 1), (70, end, 2)

Sweep:

At -50: count = 1 (replica 0)
At -10: count = 3 (all replicas) ← best overlap
At 50: count = 1 (replica 2)
At 70: count = 0

Result: Interval [-10, 50] with 3 sources (100% quorum)

Implementation Details

Clock Offset Calculation

Given ping/pong exchange:

m0: Our monotonic time when sending ping
t1: Remote’s wall clock time when responding
m2: Our monotonic time when receiving pong

RTT = m2 - m0
one_way_delay = RTT / 2
our_time_at_t1 = window_realtime + (m2 - window_monotonic)
clock_offset = t1 + one_way_delay - our_time_at_t1

Key insight: We keep the sample with minimum one-way delay (most accurate measurement).

Epoch-Based Synchronization

Clock synchronization happens in epochs (multi-second windows):

Sample Collection Window (3-10 seconds)
- Collect clock measurements from all replicas via heartbeat ping/pong
- Keep best sample per replica (minimum RTT)
Synchronization Attempt
- After CLOCK_SYNC_WINDOW_MIN_MS (3s): try to synchronize
- Before CLOCK_SYNC_WINDOW_MAX_MS (10s): must synchronize to prevent drift
- Requires quorum samples + Marzullo agreement
Epoch Installation
- If successful: install synchronized interval as new epoch
- Epoch valid for CLOCK_EPOCH_MAX_MS (30s)
- After expiry: must re-synchronize
Timestamp Assignment
- Primary reads from current epoch
- Clamp system time to epoch bounds: realtime.clamp(lower, upper)
- Enforce monotonicity: timestamp = max(clamped, last_timestamp)

Configuration Parameters

Parameter	Value	Purpose
`CLOCK_OFFSET_TOLERANCE_MS`	500ms	Maximum allowed clock offset between replicas
`CLOCK_SYNC_WINDOW_MIN_MS`	3,000ms	Minimum sample collection duration
`CLOCK_SYNC_WINDOW_MAX_MS`	10,000ms	Maximum window before forced sync
`CLOCK_EPOCH_MAX_MS`	30,000ms	Maximum epoch age before stale

Rationale for 500ms tolerance:

TigerBeetle uses 100ms (stricter, requires better NTP)
We’re more conservative to handle less reliable NTP environments
Still well within HIPAA/GDPR timestamp accuracy requirements

Formal Verification

TLA+ Specification (specs/tla/ClockSync.tla)

Theorems:

ClockMonotonicity: Cluster time never goes backward
ClockQuorumConsensus: Synchronized time derived from quorum intersection

Model checked: 45K+ states, 0 violations (via TLC)

Kani Proofs (5 proofs)

Proof 21: Marzullo quorum intersection
- Property: Algorithm finds quorum agreement when it exists
- Verified: ≥Q overlapping intervals → synchronization succeeds
Proof 22: Clock monotonicity preservation
- Property: realtime_synchronized() never returns timestamp < last_timestamp
- Verified: Timestamps monotonically increase across all calls
Proof 23: Clock offset tolerance enforcement
- Property: synchronize() rejects intervals wider than 500ms
- Verified: Tolerance check prevents excessive clock drift
Proof 24: Epoch expiry enforcement
- Property: Stale epochs (age > 30s) return None, forcing re-sync
- Verified: Prevents using outdated clock consensus
Proof 25: Clock arithmetic overflow safety
- Property: Clock offset calculations never overflow
- Verified: All time arithmetic uses safe bounds

Production Assertions

CRITICAL: All 5 assertions use assert!() (not debug_assert!()) for HIPAA/GDPR compliance:

Monotonicity (clock.rs:512): timestamp >= last_timestamp
Tolerance (clock.rs:441): interval.width() <= tolerance_ns
Quorum (clock.rs:401): sources_sampled >= quorum
Epoch age (clock.rs:503): epoch_age <= CLOCK_EPOCH_MAX_MS
Primary-only (clock.rs:460): Documented as requirement at call site

VOPR Testing (4 scenarios)

1. ClockDrift

Test: Gradual clock drift across replicas
Verify: Tolerance detection within 500ms bounds
Config: 30s runtime, 50K events, 5% gray failures

2. ClockOffsetExceeded

Test: Clock offset exceeds 500ms tolerance
Verify: synchronize() rejects excessive drift
Config: High network delay (100ms max), aggressive swizzle-clogging

3. ClockNtpFailure

Test: Simulate NTP server failure (no clock samples)
Verify: Graceful degradation, timestamp assignment continues with stale epoch
Config: 30% packet drop rate, 20s runtime

4. ClockBackwardJump

Test: Primary partitioned, system clock jumps backward, view change occurs
Verify: Monotonicity preserved across view change despite backward jump
Config: 25s runtime, 40K events, network partition + intermittent failures

All scenarios pass: 1M iterations per scenario, 0 violations

Before Phase 1.1

No cluster-wide time consensus
Timestamps could diverge across replicas
No monotonicity guarantees during view changes
Audit log timestamps unreliable

After Phase 1.1

Cluster consensus on time (quorum agreement)
Bounded uncertainty (≤500ms offset)
Monotonicity guaranteed (formally proven)
View change safety (timestamps never decrease)
NTP-independent HA (continues with stale epoch if NTP fails)

Compliance readiness:

HIPAA: 80% → 95% (timestamp accuracy guaranteed)
GDPR: 70% → 85% (audit trail temporal ordering reliable)
SOC 2: 75% → 85% (security event timestamps accurate)

Error Handling

Clock Errors

Error	Cause	Recovery
`SelfSample`	Attempted to learn sample from ourselves	Rejected (assertion)
`NonMonotonicPing`	Ping timestamps m0 > m2	Rejected (malformed message)
`StalePing`	Sample from before current window	Rejected (prevent replay)
`NoQuorumAgreement`	Marzullo found < quorum sources	Wait for more samples
`ToleranceExceeded`	Interval width > 500ms	Wait for NTP resync

Degradation Modes

Insufficient samples: Keep collecting, don’t synchronize yet
Excessive drift: Wait for NTP to fix clocks, timestamps unavailable
Stale epoch: Continue using old epoch (bounded staleness)
Single-node cluster: Bypass synchronization, use system time directly

Performance Characteristics

Synchronization overhead: <5% (sample collection via heartbeats)
Timestamp assignment latency: +1μs p99 (clamping + monotonicity check)
Epoch synchronization frequency: Every 3-10 seconds (amortized cost)
Memory per replica: ~100 bytes (sample storage)

Assertion overhead: <0.1% throughput regression (cold branches, negligible)

References

Academic Papers

Marzullo, K. (1984). “Maintaining the Time in a Distributed System”. Ph.D. dissertation, Stanford University.
Corbett, J. C., et al. (2013). “Spanner: Google’s Globally Distributed Database” (TrueTime API).
Liskov, B., & Cowling, J. (2012). “Viewstamped Replication Revisited” (VRR clock discussion).

Industry Implementations

TigerBeetle: “Three Clocks are Better than One” (blog post)
Google Spanner: TrueTime API design (bounded timestamp uncertainty)
NTP: Network Time Protocol (source of external time)

Internal Documentation

docs/concepts/compliance.md - Timestamp accuracy guarantees
docs/traceability_matrix.md - TLA+ → Rust → VOPR traceability
docs/ASSERTIONS.md - Production assertion guidelines

Future Work (Phase 2+)

Extended timeout coverage (PrimaryAbdicate, CommitStall detection)
Leap second handling (coordinate with NTP leap second announcements)
Clock skew metrics (Prometheus export of clock offset distribution)
Dynamic tolerance adjustment (tighten tolerance when NTP stable)

Implementation Status: Complete (Phase 1.1 - v0.3.0) Verification: 5 Kani proofs, 4 VOPR scenarios, 1 TLA+ spec, 5 production assertions Compliance Impact: HIPAA 95%, GDPR 85%, SOC 2 85%