Safe LLM Integration with VOPR
On this page
- Core Principle
- The Risk: Nondeterminism
- The Solution: Offline-Only LLMs
- Architecture
- Strict Separation
- Safety Guarantees
- Use Case 1: Scenario Generation
- Goal
- Workflow
- Safety Mechanism
- Use Case 2: Failure Analysis
- Goal
- Workflow
- Safety Mechanism
- Use Case 3: Test Case Shrinking
- Goal
- Workflow
- Safety Mechanism
- Use Case 4: Mutation Suggestions
- Goal
- Workflow
- Safety Mechanism
- Validation: Defense-in-Depth
- 1. Schema Validation
- 2. Whitelist Checks
- 3. Range Checks
- 4. Forbidden Directive Scan
- 5. Length Limits
- Comparison to Naive LLM Integration
- Best Practices
- ✅ DO
- ❌ DON’T
- Example: End-to-End Workflow
- References
This document explains how Large Language Models (LLMs) are used in Kimberlite’s VOPR testing framework without compromising determinism or correctness.
Core Principle
LLMs suggest, validators verify, invariants decide.
LLMs are idea generators, not judges.
The Risk: Nondeterminism
LLMs are probabilistic. If you use an LLM during a VOPR run to make decisions, you break determinism:
❌ BAD: LLM in the loop
┌─────────────┐
│ VOPR run │
│ (seed=42) │
└──────┬──────┘
│
▼
┌─────────────┐
│ Should we │ ← LLM decides
│ inject a │ (nondeterministic!)
│ fault? │
└──────┬──────┘
│
▼
Same seed ≠ Same execution ← BROKEN
Result: Bugs are irreproducible, VOPR is useless.
The Solution: Offline-Only LLMs
LLMs operate before or after VOPR runs, never during:
✅ GOOD: LLM offline
1. GENERATE (offline)
┌─────────────┐
│ LLM │ → scenario.json (validated)
└─────────────┘
2. EXECUTE (deterministic)
┌─────────────┐
│ VOPR run │ → same seed = same execution
│ (seed=42) │
└─────────────┘
3. ANALYZE (offline)
┌─────────────┐
│ LLM │ → hypothesis + suggestions
└─────────────┘
Result: Determinism preserved, LLMs enhance testing.
Architecture
Strict Separation
┌────────────────────────────────────────────────┐
│ LLM Layer (offline) │
│ - Generates scenario JSON │
│ - Analyzes failure traces │
│ - Suggests mutations │
│ - Helps shrink test cases │
└───────────────┬────────────────────────────────┘
│ JSON only
▼
┌────────────────────────────────────────────────┐
│ Validation Layer (deterministic) │
│ - Schema validation │
│ - Whitelist checks (fault types, mutations) │
│ - Range checks (probabilities [0.0, 1.0]) │
│ - Forbidden directive scan │
└───────────────┬────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────┐
│ VOPR (deterministic execution) │
│ - Hard invariants decide pass/fail │
│ - No LLM influence on correctness │
└────────────────────────────────────────────────┘
Safety Guarantees
LLMs CANNOT:
- Influence deterministic execution
- Override invariant decisions
- Inject nondeterminism mid-simulation
- Skip checks or disable faults
- Modify seeds or RNG state
LLMs CAN:
- Generate scenario JSON (validated before use)
- Analyze failure traces (post-mortem only)
- Suggest code paths to investigate
- Recommend mutations to try
- Assist with test case reduction
Use Case 1: Scenario Generation
Goal
Generate adversarial scenarios to stress-test specific properties (e.g., view changes, MVCC visibility, tenant isolation).
Workflow
1. Generate Prompt
use prompt_for_scenario_generation;
let prompt = prompt_for_scenario_generation;
// Output:
// "You are a distributed systems testing expert. Generate a VOPR scenario
// to stress-test: view changes under packet loss
//
// Existing scenarios:
// - baseline
// - swizzle_clogging
//
// Requirements:
// - Focus on realistic adversarial conditions
// - Use fault injection types: network_partition, packet_delay, packet_drop,
// storage_corruption, crash
// - Keep probabilities low (0.001 - 0.05 range)
// - Provide clear rationale
//
// Output valid JSON matching LlmScenarioSuggestion schema."
2. Call LLM (Claude, GPT, etc.)
// Using Claude API (example)
let llm_response = call_claude_api?;
LLM returns JSON:
3. Validate (CRITICAL STEP)
use ;
let suggestion: LlmScenarioSuggestion = from_str?;
// Validation checks:
// - All probabilities in [0.0, 1.0]
// - Only known fault types (whitelist)
// - Workload parameters reasonable (ops/sec < 100k, tenants < 1000)
// - No attempts to inject nondeterminism
validate_llm_scenario?;
If validation fails:
❌ LLM scenario validation failed:
- Unknown fault type: "nuclear_launch" (allowed: packet_delay, packet_drop, ...)
- Probability out of range: packet_delay=1.5 (must be in [0.0, 1.0])
4. Convert to VOPR Config
let config = from_llm_suggestion;
let mut runner = new;
runner.run?;
Now VOPR runs deterministically with the LLM-generated scenario.
Safety Mechanism
Validation is mandatory defense-in-depth:
- Whitelist of allowed fault types
- Range checks on all numeric values
- Schema enforcement (JSON structure)
- Forbidden directive scanning
LLMs can’t bypass this validation.
Use Case 2: Failure Analysis
Goal
When VOPR detects an invariant violation, use an LLM to suggest root causes and next steps.
Workflow
1. Collect Failure Data
use FailureTrace;
let trace = FailureTrace ;
2. Generate Analysis Prompt
let prompt = prompt_for_failure_analysis;
// Output:
// "Analyze this VOPR failure:
//
// Seed: 42
// Scenario: combined
// Violated Invariant: LinearizabilityChecker
// Message: Read observed stale value
//
// Recent Events:
// - [1000ms] NetworkPartition applied
// - [1005ms] Client write: key=x, value=1
// - [1010ms] Client read: key=x, observed=0 (expected=1)
//
// Stats:
// - Events processed: 5000
// - Fault injections: 12
//
// Provide:
// 1. Root cause hypothesis
// 2. Related invariants to check
// 3. Suggested mutations to isolate bug
// 4. Relevant code paths"
3. Call LLM
let llm_response = call_claude_api?;
LLM returns JSON:
4. Validate (CRITICAL STEP)
use ;
let analysis: LlmFailureAnalysis = from_str?;
// Validation checks:
// - Confidence in [0.0, 1.0]
// - No forbidden directives ("skip_invariant", "override_seed", etc.)
// - Field lengths reasonable (< 10k chars)
validate_llm_analysis?;
5. Human Reviews
The analysis is presented to a human:
Root Cause Hypothesis (confidence: 80%):
Network partition caused write to commit on majority but not reach the
replica serving the read. Read served stale data from minority partition.
Related Invariants:
- vsr_agreement
- replica_consistency
- read_your_writes
Suggested Mutations:
1. Increase partition probability to 0.1
2. Add repair delay to prevent quick catchup
3. Run with single-client workload to isolate
Relevant Code Paths:
- crates/kimberlite-vsr/src/replica.rs:prepare_phase
- crates/kimberlite-vsr/src/replica.rs:commit_phase
Human decides whether to follow the suggestions.
Safety Mechanism
- LLM never decides correctness (invariants do that)
- LLM output is informational only
- Human reviews before taking action
- Forbidden directives blocked (“skip this check”, “assume this is fine”)
Use Case 3: Test Case Shrinking
Goal
When a failure is found, reduce it to the minimal reproducing case (delta debugging).
Workflow
1. Start with Full Failure
use TestCaseShrinker;
let events = vec!;
let mut shrinker = new;
2. Binary Search for Minimal Subset
while let Some = shrinker.next_candidate
println!;
3. LLM-Assisted Heuristics (Optional)
Instead of binary search, ask LLM which events to try removing first:
Prompt: "Given this failure trace, which events are most likely irrelevant?
[Event list]
Focus on events related to: LinearizabilityChecker violation"
LLM: "Events e10, e15, e20 are likely unrelated (they're tenant 2 operations,
but the failure is in tenant 1). Try removing those first."
This is a heuristic - the LLM doesn’t decide, just guides the search order.
Safety Mechanism
- Validation always checks if bug still reproduces
- LLM can’t force a “minimal case” that doesn’t actually fail
- Human verifies final minimal case
Use Case 4: Mutation Suggestions
Goal
VOPR ran without violations. LLM suggests variations that might trigger dormant bugs.
Workflow
1. Identify Invariants That Didn’t Trigger
let invariants_not_violated = vec!;
2. Generate Mutation Prompt
let prompt = prompt_for_mutation_suggestions;
// Output:
// "Scenario 'combined' ran but did NOT violate these invariants:
// - vsr_view_change_safety
// - projection_mvcc_visibility
//
// Suggest mutations to stress these invariants specifically."
3. Call LLM
LLM returns suggestions:
4. Validate
use ;
let mutation: LlmMutationSuggestion = from_str?;
// Validation checks:
// - Only known mutation types (increase_fault_rate, add_partition, extend_duration)
// - Parameters within bounds
validate_llm_mutation?;
5. Apply Mutation
let mut config = default;
config.network_fault_rate = 0.05; // Increased from 0.01
runner.run_with_config?;
Safety Mechanism
- Whitelist of allowed mutation types
- Parameter bounds enforced
- Mutations don’t bypass invariants (they increase stress, not reduce checks)
Validation: Defense-in-Depth
All LLM outputs pass through mandatory validation:
1. Schema Validation
JSON structure must match expected schema (serde deserialization).
2. Whitelist Checks
Only allow known values:
- Fault types:
packet_delay,packet_drop,network_partition,storage_corruption,crash - Mutation types:
increase_fault_rate,add_partition,extend_duration,add_workload,enable_repair_delay
Unknown values → rejected.
3. Range Checks
Numeric values must be in bounds:
- Probabilities:
[0.0, 1.0] - Operations/sec:
< 100,000 - Tenants:
< 1,000 - Duration steps:
< 10,000,000
Out-of-range → rejected.
4. Forbidden Directive Scan
Reject outputs containing:
"skip_invariant""override_seed""disable_checks""bypass_validation""force_pass"
Case-insensitive substring match.
5. Length Limits
Text fields capped:
- Descriptions: < 10,000 chars
- Rationale: < 5,000 chars
- Code paths: < 500 chars each
Prevents prompt injection or exfiltration attempts.
Comparison to Naive LLM Integration
| Approach | Determinism | Safety | Usefulness |
|---|---|---|---|
| LLM decides correctness | Broken | Unsafe | High risk |
| LLM in VOPR loop | Broken | Unsafe | Nondeterministic |
| LLM offline (validated) | Preserved | Safe | High value |
Kimberlite uses LLM offline (validated) exclusively.
Note: Planned LLM integration enhancements are documented in ROADMAP.md.
Best Practices
✅ DO
- Always validate LLM output before using it
- Use LLMs for idea generation, not decision-making
- Keep LLMs offline (before/after VOPR runs, never during)
- Review LLM suggestions before acting
- Track LLM usage in logs (prompt + response for audit)
❌ DON’T
- Let LLMs decide invariant pass/fail
- Use LLMs during deterministic execution
- Skip validation (“it’s just a suggestion”)
- Blindly apply LLM-generated mutations
- Use LLMs for security-critical decisions
Example: End-to-End Workflow
Goal: Find bugs in view change logic.
Step 1: Generate Scenario
# Prompt LLM
| \
# Validate
Step 2: Run VOPR
Step 3: Analyze Failure (if any)
# Extract failure trace
# Get LLM analysis
|
# Review
|
Step 4: Iterate
- Apply suggested mutations
- Re-run VOPR
- Compare results
Result: LLM-guided testing workflow, determinism preserved.
References
- Implementation:
/crates/kimberlite-sim/src/llm_integration.rs - Tests:
/crates/kimberlite-sim/src/llm_integration.rs(11 tests) - Philosophy:
/docs/TESTING.md(VOPR section)
Last Updated: 2026-02-02 Status: Phase 9 complete (core functionality), CLI tools planned