Operational Runbook
On this page
- Table of Contents
- Emergency Contacts
- On-Call Rotation
- Communication Channels
- Incident Response Procedures
- P0: Critical - Service Down / Data Loss Risk
- P1: High - Performance Degradation
- P2: Medium - Non-Critical Issues
- Common Failure Scenarios
- Scenario 1: Replica Crash / Unresponsive
- Scenario 2: Quorum Lost (All Replicas Down)
- Scenario 3: View Change Storm (Rapid Leader Elections)
- Scenario 4: High Replication Lag
- Scenario 5: Assertion Failure (Formal Verification Violation)
- Scenario 6: Disk Full
- Performance Troubleshooting
- High Commit Latency (p99 > 100ms)
- Low Throughput (< 10K ops/sec)
- Upgrade Procedures
- Rolling Upgrade (Zero Downtime)
- Escalation Paths
- When to Escalate
- Escalation Template
- Post-Incident Review
- RCA Template
- References
Target Audience: On-call Engineers, SREs, Operations Teams Purpose: Step-by-step procedures for common operational scenarios Severity Levels: P0 (Critical), P1 (High), P2 (Medium), P3 (Low)
Table of Contents
- Emergency Contacts
- Incident Response Procedures
- Common Failure Scenarios
- Performance Troubleshooting
- Upgrade Procedures
- Escalation Paths
- Post-Incident Review
Emergency Contacts
On-Call Rotation
| Role | Primary | Secondary | Escalation |
|---|---|---|---|
| Database SRE | +1-555-0100 | +1-555-0101 | +1-555-0102 |
| Platform Engineering | +1-555-0200 | +1-555-0201 | +1-555-0202 |
| Security Team | +1-555-0300 | +1-555-0301 | +1-555-0302 |
| Compliance Officer | +1-555-0400 | N/A | +1-555-0401 |
Communication Channels
- Slack:
#kimberlite-incidents(immediate response) - PagerDuty:
kimberlite-productionservice - Status Page: https://status.kimberlite.io
- Incident Management: https://incidents.kimberlite.io
Incident Response Procedures
P0: Critical - Service Down / Data Loss Risk
Examples: All replicas down, quorum lost, assertion failures, data corruption
Response Time: 15 minutes (24/7)
Procedure:
# 1. ACKNOWLEDGE ALERT (within 5 minutes)
# - PagerDuty acknowledgment
# - Post in #kimberlite-incidents: "P0 incident - investigating"
# 2. ASSESS SEVERITY
# Check: How many replicas are healthy? Is quorum available?
# 3. IMMEDIATE TRIAGE
# If quorum available (2/3 or 3/5):
# - Service degraded but operational
# - Continue to step 4
# If quorum lost (< 2/3):
# - SERVICE DOWN - All writes failing
# - Escalate immediately to Platform Engineering
# - Update status page: "Major Outage"
# 4. CHECK MONITORING DASHBOARDS
# - Grafana: "Kimberlite Cluster Overview"
# - Look for: replica status, view changes, replication lag, disk errors
# 5. COLLECT DIAGNOSTIC INFO
# 6. FOLLOW SPECIFIC RUNBOOK
# - See "Common Failure Scenarios" section below
# - Document all actions taken in incident ticket
# 7. COMMUNICATION (every 15 minutes)
# - Update #kimberlite-incidents with progress
# - Update status page if customer-facing impact
# 8. RESOLVE OR ESCALATE
# - If resolved: Run post-incident review
# - If not resolved within 30 minutes: Escalate to engineering lead
P1: High - Performance Degradation
Examples: High latency (>100ms p99), replication lag, view change storms
Response Time: 30 minutes (business hours)
Procedure:
# 1. ACKNOWLEDGE ALERT
# Post in #kimberlite-incidents: "P1 incident - investigating"
# 2. CHECK CURRENT METRICS
# Prometheus queries:
# 3. IDENTIFY ROOT CAUSE
# Common causes:
# - High disk latency (check IOPS)
# - Network issues (check packet loss)
# - Memory pressure (check available RAM)
# - CPU saturation (check load average)
# 4. APPLY MITIGATION
# See "Performance Troubleshooting" section
# 5. MONITOR FOR IMPROVEMENT
# Wait 10 minutes, recheck metrics
# If not improving: Escalate to Database SRE lead
# 6. DOCUMENT AND CLOSE
# Create incident ticket with:
# - Root cause
# - Mitigation applied
# - Follow-up actions (if any)
P2: Medium - Non-Critical Issues
Examples: Single replica down, moderate lag, configuration drift
Response Time: 2 hours (business hours)
Procedure:
# 1. TRIAGE
# - Service operational? Yes (quorum maintained)
# - Customer impact? No (transparent failover)
# 2. INVESTIGATE
# - Check replica logs
# - Review recent changes (deployments, config)
# 3. FIX OR SCHEDULE
# - If quick fix (restart, config): Apply immediately
# - If requires deep investigation: Schedule maintenance window
# 4. DOCUMENT
# - Update runbook if new issue discovered
# - Create Jira ticket for follow-up
Common Failure Scenarios
Scenario 1: Replica Crash / Unresponsive
Symptoms:
- Alert:
kimberlite_cluster_replicas_total{status="healthy"} < 3 - Replica not responding to heartbeats
- View change triggered
Diagnosis:
# 1. Check replica status
# Output: "status": "unreachable"
# 2. Check host-level health
|
# 3. Check replica logs
# Look for: panics, assertion failures, OOM killer
Resolution:
# Option A: Restart replica (if simple crash)
# OR (Kubernetes)
# Wait for replica to rejoin
# Expected: "status": "recovering" -> "normal"
# Option B: If restart fails, check for corruption
# If corruption detected:
# Option C: If hardware failure suspected
# 1. Remove from cluster
# 2. Replace hardware
# 3. Add back as new replica
Prevention:
- Monitor:
kimberlite_assertion_failures_total(must be 0) - Enable: Automatic restarts (Kubernetes liveness probes)
- Schedule: Weekly storage scrubbing
Scenario 2: Quorum Lost (All Replicas Down)
Symptoms:
- Alert:
kimberlite_cluster_replicas_total{status="healthy"} < 2 - All client writes failing
- Status page: “Major Outage”
CRITICAL: This is a P0 incident. Escalate immediately.
Diagnosis:
# 1. Check if replicas are actually down or network partitioned
for; do
done
# 2. Check load balancer / network infrastructure
# Contact network operations team
# 3. If replicas are truly down, check why
# - Power outage in datacenter?
# - Kubernetes cluster failure?
# - Cascading failure (e.g., disk full on all nodes)?
Resolution:
# Priority: Get ANY replica online to restore read access
# Option A: If network partition
# - Fix network connectivity
# - Replicas will auto-recover once they can communicate
# Option B: If datacenter failure (multi-region with standby)
# FAILOVER TO DR REGION
# Update DNS
# Option C: If all data lost (LAST RESORT)
# 1. Restore from most recent backup
# 2. Data loss will occur (RPO = backup age)
# 3. Notify compliance officer (breach reporting may be required)
Prevention:
- Deploy: Multi-region with standby replicas
- Test: Monthly DR drills
- Monitor:
kimberlite_cluster_replicas_total{status="healthy"}>= 2
Scenario 3: View Change Storm (Rapid Leader Elections)
Symptoms:
- Alert:
rate(kimberlite_consensus_view_changes_total[5m]) > 10 - Frequent leader changes (every few seconds)
- High latency, low throughput
Diagnosis:
# 1. Check view change history
# Output: View changes with timestamps and reasons
# 2. Check clock synchronization
for; do
done
# If offset > 500ms: CLOCK DESYNC DETECTED
# 3. Check network latency between replicas
for; do
|
done
# If RTT > 100ms: NETWORK ISSUE
Resolution:
# If clock desync:
# 1. Force clock synchronization
for; do
done
# 2. Wait for clocks to stabilize (30-60 seconds)
# 3. Monitor view changes should stop
# If network latency:
# 1. Check network infrastructure (switch ports, congestion)
# 2. Contact network operations
# 3. Increase election timeout temporarily (emergency only)
# If persistent issue:
# 1. Isolate problematic replica
# 2. Run diagnostics
# 3. Remove from cluster if necessary
Prevention:
- Monitor: Clock offset
< 100ms(alert at 200ms) - NTP: Use multiple reliable NTP servers
- Network: Ensure <1ms RTT between replicas
Scenario 4: High Replication Lag
Symptoms:
- Alert:
kimberlite_replication_lag_ms > 1000 - Backup replicas falling behind primary
- Standby replicas marked ineligible for promotion
Diagnosis:
# 1. Check current lag
# Output: Lag in operations and milliseconds per replica
# 2. Check disk performance
# Look for: %util > 90%, await > 20ms
# 3. Check network throughput
# Look for: saturation, packet drops
# 4. Check replica CPU/memory
# Look for: CPU > 80%, memory pressure
Resolution:
# If disk I/O bottleneck:
# 1. Check for runaway queries
# Kill expensive queries if found
# 2. Trigger log compaction (free up IOPS)
# 3. Increase disk IOPS (cloud)
# If network bottleneck:
# 1. Check for bandwidth saturation
# 2. Throttle read queries (if standby serving reads)
# If CPU bottleneck:
# 1. Scale up instance size
# 2. Add more standby replicas to distribute read load
# Emergency: Remove lagging replica
# Fix issue, then rejoin
Prevention:
- Monitor: Lag every 15 seconds
- Alert: Lag > 500ms (P2), Lag > 1000ms (P1)
- Capacity: Plan for 50% headroom on disk IOPS
Scenario 5: Assertion Failure (Formal Verification Violation)
Symptoms:
- Alert:
kimberlite_assertion_failures_total > 0(CRITICAL) - Replica panicked with assertion failure
- Log contains:
assertion failed: ...
CRITICAL: This indicates a bug in Kimberlite or hardware corruption. Treat as P0.
Diagnosis:
# 1. IMMEDIATELY ISOLATE AFFECTED REPLICA
# 2. Capture diagnostics BEFORE any recovery
# 3. Check log for assertion details
|
# Example: "assertion failed: commit_number <= op_number"
# 4. Check for hardware issues
Resolution:
# DO NOT ATTEMPT AUTOMATIC RECOVERY
# 1. Create incident ticket with:
# - Assertion failure message
# - Replica state dump
# - Storage export
# 2. Contact Kimberlite engineering (escalate to P0)
# - This is a formal verification violation
# - May indicate software bug or hardware corruption
# 3. Remove replica from cluster
# 4. Restore from verified backup
# 5. Run full verification before rejoining
# 6. Rejoin cluster only after engineering approval
Prevention:
- Monitor:
kimberlite_assertion_failures_total== 0 (always) - Hardware: Use ECC RAM, enterprise-grade SSDs
- Testing: Run VOPR scenarios in staging before production deploy
Scenario 6: Disk Full
Symptoms:
- Alert:
kimberlite_storage_disk_usage_percent > 90 - Log writes failing
- Replica unable to accept new operations
Diagnosis:
# 1. Check disk usage
# Output: 95% used
# 2. Identify largest files
|
# Likely culprit: log files
Resolution:
# Option A: Trigger log compaction (fast, minutes)
# Removes old log entries, frees space
# Option B: Increase disk size (cloud, minutes)
# Then extend filesystem
# Option C: Emergency cleanup (use with caution)
# Delete old backup snapshots (if stored locally)
# Option D: Offload to S3 (for cold data)
Prevention:
- Monitor: Disk usage every 5 minutes
- Alert: 80% (P3), 90% (P1), 95% (P0)
- Automation: Auto-compaction at 85%
- Capacity: Plan for 30% free space
Performance Troubleshooting
High Commit Latency (p99 > 100ms)
Target: p99 < 50ms, p99.9 < 100ms
Diagnosis Tree:
High p99 latency?
├─ Check disk fsync latency
│ └─ iostat -x 5 3 | grep nvme
│ └─ await > 10ms? → Disk I/O issue
│ ├─ Check for competing I/O (other processes)
│ ├─ Check disk queue depth (nr_requests)
│ └─ Consider faster storage (Optane)
│
├─ Check network RTT
│ └─ ping replica-1 | grep time
│ └─ RTT > 5ms? → Network latency
│ ├─ Check for packet loss (mtr)
│ ├─ Check switch configuration
│ └─ Consider placement groups (AWS)
│
├─ Check quorum size
│ └─ kimberlite-cli cluster status
│ └─ 5-replica cluster? → Inherently higher latency
│ └─ Consider reducing to 3-replica if acceptable
│
└─ Check CPU saturation
└─ top | grep kimberlite
└─ CPU > 80%? → CPU bottleneck
├─ Check for expensive queries
├─ Scale up instance size
└─ Optimize query patterns
Quick Fixes:
# 1. Reduce batch size (lower latency, lower throughput)
# 2. Increase heartbeat frequency (detect issues faster)
# 3. Check for slow queries
# Kill if found: kimberlite-cli query kill <query-id>
Low Throughput (< 10K ops/sec)
Target: >50K ops/sec (3-replica, NVMe SSDs)
Diagnosis:
# 1. Check current throughput
# 2. Check batch size
# If < 1000: Increase for higher throughput
# 3. Check for bottlenecks
# - CPU: top (should be < 60%)
# - Disk: iostat -x 5 (await < 5ms)
# - Network: iftop (< 5 Gbps on 10G link)
# 4. Check for serialization bottlenecks
# Look for: hot functions in flamegraph
Quick Fixes:
# 1. Enable batching (groups small writes)
# 2. Tune compaction (run less frequently during peak)
# 3. Add standby replicas (offload read queries)
Upgrade Procedures
Rolling Upgrade (Zero Downtime)
Prerequisites:
- Backup completed in last 24 hours
- Staging environment tested
- Change window approved
Procedure:
# 1. PRE-UPGRADE CHECKS
# Verify: All replicas healthy, no lag
# Current version: v0.3.0
# 2. UPGRADE REPLICAS SEQUENTIALLY (one at a time)
for; do # Backups first, primary last
# Kubernetes
# OR Docker
# docker pull kimberlite/server:v0.4.0
# systemctl restart kimberlite
# Wait for replica to rejoin
# Verify cluster version progressed
# Expected: cluster_version increases as each replica upgrades
# Wait 5 minutes between replicas (observe stability)
done
# 3. VERIFY FEATURE ACTIVATION
# Expected: New features enabled when all replicas reach v0.4.0
# 4. POST-UPGRADE VALIDATION
# Verify: All replicas on v0.4.0, cluster healthy
# Run smoke tests
# 5. MONITOR FOR 1 HOUR
# Watch for: increased latency, errors, view changes
Rollback Procedure:
# If issues detected, rollback immediately
for; do
done
# Document reason for rollback in incident ticket
Escalation Paths
When to Escalate
| Situation | Escalate To | Timeframe |
|---|---|---|
| Quorum lost | Platform Engineering Lead | Immediate |
| Assertion failure | Kimberlite Engineering | Immediate |
| Multi-region outage | VP Engineering | Immediate |
| Data corruption detected | CTO + Compliance Officer | 15 minutes |
| Performance degradation (P1) | Database SRE Lead | 30 minutes |
| Unresolved issue after 1 hour | On-call manager | 1 hour |
Escalation Template
Subject: [P0] Kimberlite Incident - Quorum Lost
Severity: P0 (Critical)
Status: Active
Started: 2026-02-06 14:30 UTC
Impact: All writes failing, reads degraded
Current State:
- Replicas 0, 1 DOWN
- Replica 2 HEALTHY (no quorum)
- Root cause: Unknown (investigating)
Actions Taken:
1. Acknowledged PagerDuty at 14:31 UTC
2. Checked replica logs (no obvious errors)
3. Verified network connectivity (all reachable)
4. Attempting replica restarts
Requesting:
- Platform Engineering assistance
- Escalated to VP Engineering (CC'd)
Incident Link: https://incidents.kimberlite.io/INC-12345
War Room: #incident-12345
Post-Incident Review
RCA Template
Incident: [INC-12345] High Replication Lag
Date: 2026-02-06 Severity: P1 Duration: 45 minutes (14:30 - 15:15 UTC) Impact: Standby replica marked ineligible, DR failover unavailable
Timeline:
- 14:30: Alert triggered (lag > 1000ms)
- 14:32: On-call acknowledged, began investigation
- 14:35: Identified root cause (disk I/O saturation)
- 14:40: Applied mitigation (log compaction)
- 15:00: Lag returned to < 100ms
- 15:15: Incident resolved
Root Cause: Log compaction not running automatically due to misconfiguration. Log file grew to 800GB, exhausting disk IOPS.
Resolution: Manually triggered log compaction, freed 200GB, IOPS returned to normal.
Action Items:
- Fix compaction scheduler config (Priority: P0, Owner: @sre-alice)
- Add alert for compaction failures (Priority: P1, Owner: @sre-bob)
- Increase disk IOPS provisioned by 50% (Priority: P2, Owner: @sre-charlie)
- Document compaction monitoring in runbook (Priority: P3, Owner: @sre-alice)
Lessons Learned:
- Compaction is critical for IOPS management
- Need better visibility into storage subsystem health
- Automated remediation for common issues (auto-compact at 85% disk)
Compliance Impact: None. No data loss, no audit log gaps, no compliance violations.
References
- Production Deployment: Production Deployment Guide
- Monitoring: Monitoring and Alerting
- Configuration: Configuration Guide
- Security: Security Best Practices
- Compliance: Compliance Certification Package
Last Updated: 2026-02-06 Version: 0.4.0 On-Call Rotation: https://pagerduty.com/kimberlite-production