Fault-Tolerant Verification
Jepsen-style testing for distributed SPI systems
Gay.jl includes a fault injection framework inspired by Jepsen for testing distributed system assumptions.
Fault Types
using Gay: FaultInjector, inject!, heal_all!
injector = FaultInjector(seed=42)
# Network partition between device groups
inject!(injector, NetworkPartition(
partition_a = Set([0, 1]),
partition_b = Set([2, 3]),
duration_ms = 1000.0
))
# Message delays
inject!(injector, MessageDelay(
target_devices = Set([0]),
delay_ms = 100.0,
probability = 0.5
))
# Bit flips (simulates memory/network corruption)
inject!(injector, BitFlip(
target_devices = Set([1]),
n_bits = 10,
probability = 1.0
))
# Byzantine faults (arbitrary malicious behavior)
inject!(injector, Byzantine(
target_devices = Set([2]),
probability = 0.1
))Simulated Cluster Testing
using Gay: SimulatedCluster, run_inference!, verify!
# Create test cluster
cluster = SimulatedCluster(
[("Device A", 8.0), ("Device B", 8.0)],
n_layers = 32,
n_tokens = 128,
hidden_dim = 4096
)
# Run clean inference
run_inference!(cluster; with_faults=false)
pass, errors = verify!(cluster)
@assert pass
# Inject faults and verify detection
inject!(cluster, :bit_flip; device=0, n_bits=10)
run_inference!(cluster; with_faults=true)
pass, errors = verify!(cluster)
@assert !pass # Corruption detected!Galois Connection Verification
The fault-tolerant module maintains Galois connection invariants at each step:
using Gay: GaloisConnection, alpha, gamma, verify_closure
gc = GaloisConnection(GAY_SEED)
# α(e) = hash(e) mod 226 — abstraction
color = alpha(gc, event)
# γ(c) = representative(c) — concretization
representative = gamma(gc, color)
# Closure property: α(γ(c)) = c for all colors
@assert verify_all_closures(gc)Bidirectional Color Tracking
Track colors through forward and backward passes:
using Gay: BidirectionalTracker, track_forward!, track_backward!, verify_consistency!
tracker = BidirectionalTracker(GAY_SEED)
# Forward pass
for layer in 1:32, token in 1:128
track_forward!(tracker, layer, token, 1)
end
# Backward pass (gradients)
for layer in 32:-1:1, token in 128:-1:1
track_backward!(tracker, layer, token, 1)
end
# Verify consistency
consistent, errors = verify_consistency!(tracker)
@assert consistentStatistical Fault Detection
Run multiple iterations with random fault injection:
using Gay: verify_with_fault_injection
stats = verify_with_fault_injection(
cluster;
fault_types = [:bit_flip, :message_delay],
n_iterations = 100
)
println("Detection rate: $(stats[:detection_rate] * 100)%")
println("False positive rate: $(stats[:false_positive_rate] * 100)%")
println("Galois violations: $(stats[:galois_violations])")Demo
using Gay: demo_fault_tolerant
demo_fault_tolerant()Output:
═══════════════════════════════════════════════════════════════════════
FAULT-TOLERANT SPI VERIFICATION DEMO
═══════════════════════════════════════════════════════════════════════
1. Setting up simulated cluster...
MacBook Pro: layers 1:22, expected fp=0x12345678
MacBook Air: layers 23:32, expected fp=0xabcdef01
2. Verifying Galois connection...
◆ All 226 colors satisfy α(γ(c)) = c
3. Running clean inference...
Result: ◆ PASS
4. Testing fault injection...
Bit flip (10 bits): ◇ DETECTED
Bit flip (100 bits): ◇ DETECTED
5. Testing bidirectional color tracking...
Forward/backward consistency: ◆ PASS
Galois closure at all steps: ◆ PASS
6. Running fault detection statistics...
Detection rate: 95.0%
False positive rate: 0.0%
Galois violations: 0See also:
- Distributed SPI — Tensor-parallel verification
- Galois Connections — Mathematical proofs