Troubleshooting
On this page
- Quick Diagnostic Commands
- Common Issues
- Issue 1: Cluster Has No Leader
- Issue 2: High Write Latency
- Issue 3: Projection Lag Growing
- Issue 4: Frequent View Changes
- Issue 5: Node Won’t Join Cluster
- Issue 6: Data Corruption Detected
- Advanced Debugging
- Enable Debug Logging
- Capture Packet Traces
- Profile CPU Usage
- Analyze Memory Usage
- Recovery Procedures
- Recovering from Corruption
- Recovering from Quorum Loss
- Getting Help
- Collect Diagnostic Bundle
- Enable Support Access
- Report a Bug
- Related Documentation
Debug common operational issues in Kimberlite clusters.
Quick Diagnostic Commands
# Check cluster status
|
# View recent logs
# Check metrics
|
# Verify configuration
Common Issues
Issue 1: Cluster Has No Leader
Symptoms:
- All writes fail with “No leader elected”
kmb_consensus_leadermetric shows 0- Logs show repeated election timeouts
Diagnostic:
# Check if nodes can reach each other
for; do
done
# Check view numbers (should be identical)
|
|
|
Common Causes:
Network partition - Nodes cannot reach each other
# Test connectivitySolution: Fix network configuration, ensure firewall allows port 7001
Clock skew - Node clocks differ by >500ms
# Check time on all nodesSolution: Enable NTP on all nodes
Lost quorum - Majority of nodes are down
# Check node statusSolution: Start failed nodes to restore quorum
Corrupted state - Node state is corrupted
# Check for corruptionSolution: See Recovering from Corruption
Issue 2: High Write Latency
Symptoms:
kmb_write_duration_secondsP99 > 100ms- Client writes timing out
- Logs show slow fsync operations
Diagnostic:
# Check disk I/O
# Check write latency breakdown
|
# Check if disk is full
Common Causes:
Slow disk - fsync taking >10ms
# Test disk sync performanceSolution: Upgrade to faster disks (NVMe SSD recommended)
Disk full - No space for new writes
Solution: Add disk space or enable log compaction
Network latency - High inter-node latency
# Ping other nodesSolution: Deploy nodes in same availability zone
Overloaded CPU - CPU saturated with other work
Solution: Reduce load or add more nodes
Issue 3: Projection Lag Growing
Symptoms:
kmb_projection_lagincreasing over time- Queries return stale data
- Logs show projection apply backlog
Diagnostic:
# Check projection lag
|
# Check CPU usage
|
# Check query load
|
Common Causes:
Heavy query load - Queries blocking projection updates
# Check query rate |Solution: Scale read replicas or reduce query load
Slow queries - Long-running queries holding locks
# Find slow queries in logs | |Solution: Add indexes or optimize queries
Insufficient CPU - Projection processing CPU-bound
Solution: Add more CPU cores or reduce write rate
Issue 4: Frequent View Changes
Symptoms:
kmb_consensus_view_changes_totalincreasing rapidly- Logs show repeated leader elections
- Write latency spikes during view changes
Diagnostic:
# Check view change rate
|
# Check network packet loss
|
Common Causes:
Network flakiness - Intermittent packet loss
# Test network stability |Solution: Fix network infrastructure
GC pauses - Long GC pauses causing timeouts
# Check for GC issues (Rust has no GC, but allocator could stall)Solution: Increase election timeout or reduce memory pressure
Overloaded node - Leader can’t send heartbeats
# Check CPU on leaderSolution: Reduce load or add more nodes
Issue 5: Node Won’t Join Cluster
Symptoms:
- New node fails to join existing cluster
- Logs show “Rejected by leader”
- Node stays in “Recovering” state
Diagnostic:
# Check node status
# Check if leader sees the node
|
# Check configuration
Common Causes:
Mismatched cluster name
# config.toml [server] cluster_name = "production" # Must match existing clusterWrong node ID - Node ID already in use
# Check existing node IDs |TLS certificate mismatch
# Verify certificate
Issue 6: Data Corruption Detected
Symptoms:
kmb_checksum_failures_totalincreasing- Logs show “Checksum mismatch” errors
- Queries return errors
Diagnostic:
# Check corruption metrics
|
# Find corrupted segments
Common Causes:
Disk failure - Silent data corruption
# Check disk healthSolution: See Recovering from Corruption
Power loss during write - Torn writes Solution: Enable battery-backed write cache or UPS
Kernel bug - Rare kernel I/O bug
|Solution: Update kernel
Advanced Debugging
Enable Debug Logging
# Temporary (until restart)
# Permanent
Capture Packet Traces
# Capture cluster traffic
# Analyze with wireshark
Profile CPU Usage
# Install perf
# Record CPU profile
# Generate flamegraph
| |
Analyze Memory Usage
# Check memory usage
|
# Heap profile (requires debug build)
Recovery Procedures
Recovering from Corruption
If a node detects corruption:
Stop the node
Verify corruption
Restore from replica (if node is follower)
# Delete corrupted data # Restart and let it recover from leaderRestore from backup (if node is leader or quorum lost)
# Stop all nodes # Restore data from backup # Start cluster
Recovering from Quorum Loss
If majority of nodes are down and cannot be recovered:
DO NOT do this lightly - Can cause data loss
Force a node to become leader
# DANGER: Only if quorum permanently lostAdd new nodes to restore redundancy
# Start new nodes
Getting Help
Collect Diagnostic Bundle
# Generate diagnostic bundle
# Bundle includes:
# - Configuration
# - Recent logs
# - Metrics snapshot
# - Cluster status
# - System info
Enable Support Access
# Generate one-time support token (expires in 24h)
# Share token with support team
Report a Bug
If you’ve found a bug, report it with:
- Diagnostic bundle
- Steps to reproduce
- Expected vs actual behavior
- Kimberlite version:
kimberlite-server --version
See Bug Bounty Program for security issues.
Related Documentation
- Monitoring Guide - Metrics and alerts
- Configuration Guide - Configuration options
- Deployment Guide - Deployment patterns
- Security Guide - TLS and authentication
Key Takeaway: Most issues are network, disk, or configuration problems. Check metrics first, enable debug logging if needed, and always verify configuration before restarting.