Failure & Recovery
Chio is fail-closed by default. A panic in a guard, a poisoned mutex, an exhausted WASM fuel budget, an unreachable provider: each of these produces Verdict::Deny with a structured reason, and each leaves an evidence trail. This page is the contract for which condition produces which verdict, what the circuit breaker does about it, how retries are bounded, and the recovery levers operators reach for.
Source
crates/chio-guards/src/external/circuit_breaker.rs, crates/chio-guards/src/external/retry.rs, and crates/chio-guards/src/external/mod.rs.Failure-Mode Matrix
| Failure | Verdict | Reason class | Side effects |
|---|---|---|---|
| Caught panic in guard | Deny | trap | tracing::error! with backtrace |
| Poisoned mutex | Deny | policy | May invalidate session for session-aware guards |
| WASM fuel exhausted | Deny | fuel | Guard quarantined until reset |
| WASM trap | Deny | trap | Guard module marked unhealthy |
| Circuit open (default) | Deny | policy | No provider call attempted |
| Circuit open (advisory) | Allow | policy | Opt-in via CircuitOpenVerdict::Allow |
| Parse error (request payload) | Deny | policy | Evidence carries the parse failure |
| Regex compile failure (config) | guard skipped | n/a | tracing::warn!; pipeline may be permissive |
| Network timeout (transient) | retries; Deny on exhaust | policy | Counts as failure for the breaker |
| Permanent error (4xx, malformed) | Deny | policy | No retry; not configurable |
| Session journal unavailable | Deny | policy | Affects only session-aware guards |
| Receipt store write failure | Deny | policy | Kernel cannot record evidence |
Regex compile failure is the one footgun
chio check before deployment so this turns into a load-time error rather than a runtime hole.Circuit Breaker
External-guard adapters wrap their inner provider call in a three-state breaker (crates/chio-guards/src/external/circuit_breaker.rs):
pub enum CircuitState {
Closed, // Normal operation. Failures count toward a sliding window.
Open, // Fail-fast. Calls short-circuit until reset_timeout elapses.
HalfOpen, // Probing. A bounded number of trial calls test recovery.
}The transitions:
- Closed to Open. When the number of failures inside the rolling
failure_windowreachesfailure_threshold. - Open to HalfOpen. When
reset_timeouthas elapsed since the breaker opened. The next call admitted moves to HalfOpen. - HalfOpen to Closed. After
success_thresholdconsecutive successes. - HalfOpen to Open. Any single failure during probing reopens the breaker.
pub struct CircuitBreakerConfig {
pub failure_threshold: u32, // default 5
pub failure_window: Duration, // default 60s
pub success_threshold: u32, // default 2
pub reset_timeout: Duration, // default 30s
}CircuitOpenVerdict
When the breaker is Open, the adapter does not call the provider. Instead it returns a configurable verdict:
pub enum CircuitOpenVerdict {
Deny, // Fail-closed. Default.
Allow, // Fail-open. Advisory only.
}Pick Allow only when unavailability is preferable to a denied request. The two legitimate cases:
- The guard's output feeds a human-review queue rather than gating an action. A missed signal is recoverable.
- The guard is one of several independent layers and the others are still in the synchronous chain. Losing one advisory layer is not a security regression.
Never fail-open a last-line-of-defense guard
Allow turns the breaker into a single-point-of-failure exfil channel.RateLimitedVerdict
Same shape, different trigger. When the adapter's token bucket is empty:
pub enum RateLimitedVerdict {
Deny, // Default. The QPS budget is exhausted; deny the request.
Allow, // Advisory: allow the request even though the budget is exhausted.
}Retry Strategies
Adapters retry transient and timeout errors only. 4xx and malformed-request errors are permanent; they short-circuit the retry loop and become Verdict::Deny.
pub enum BackoffStrategy {
Exponential, // base_delay * 2^(attempt - 1)
Constant, // base_delay
Linear, // base_delay * attempt
}
pub struct RetryConfig {
pub max_retries: u32, // default 3 (so 4 total attempts)
pub base_delay: Duration, // default 100ms
pub max_delay: Duration, // default 5s
pub jitter_fraction: f64, // default 0.25 (clamped to [0.0, 1.0])
pub strategy: BackoffStrategy,// default Exponential
}Three things to know:
- Total attempts is max_retries + 1. A value of 0 means "run exactly once, no retries".
- Jitter is bounded multiplicative. The configured delay is multiplied by
1 + uniform(-jitter_fraction, +jitter_fraction). Defaults give a ±25% spread, which is enough to avoid thundering herd while keeping the worst-case bounded. - Jitter is deterministic by default. The retry RNG is seeded from
max_retriesso tests reproduce. Production callers wanting non-deterministic jitter can useretry_with_jitter_rng.
Backoff selection
Exponential is right for most providers: a temporarily-degraded service recovers faster when retries thin out. Linear suits providers that throttle on a fixed window, where doubling delay overshoots the throttle. Constant is for tests and rare deterministic-cadence health checks.Composed Adapter Flow
Inside AsyncGuardAdapter, the failure-handling pieces compose in a fixed order:
evaluate(ctx):
1. CircuitBreaker.allow_call() -> CircuitOpenVerdict on deny
2. TtlCache.get(cache_key) -> cached verdict on hit
3. TokenBucket.try_acquire() -> RateLimitedVerdict on empty
4. retry_with_jitter(inner.eval) -> Verdict::Deny on permanent failure
-> Verdict on success (also cached)Two invariants matter:
- Cache hits do not spend rate-limit budget. The cache check precedes the token bucket on purpose: a steady-state hot cache is free.
- Rate-limited calls do not count as breaker failures. Only real attempts at the external service feed the failure window. A bursty client cannot trip the breaker by exceeding the rate limit.
Recovery Patterns
Hot-Reload a WASM Guard
A WASM guard that hits its fuel ceiling or traps repeatedly is quarantined: the bundle store marks it unhealthy and pipeline evaluations short-circuit to Verdict::Deny. To recover, push a corrected bundle through the hot-reload path. The reload metric chio_guard_reload_total labels by outcome: applied, canary_failed, rolled_back. A canary-failed reload leaves the prior epoch in place; the new bundle is rejected without disturbing in-flight requests.
Manual Circuit-Breaker Reset
The breaker normally drives itself: Open to HalfOpen on reset_timeout elapsing, HalfOpen to Closed on success_threshold consecutive successes. Operators can short-circuit this via the admin API when they have out-of-band confirmation that the provider has recovered. Use sparingly: forcing Closed during a real outage just trips the breaker again on the next request.
Session Journal Repair
Session journals reload from the receipt log on kernel restart. Replay the receipts for an active session and the journal rebuilds. A poisoned-mutex panic that survived restart points at an underlying invariant break; the repair-on-restart path is the recovery hatch but the underlying bug deserves a kernel issue.
Worked Example: Tuned for a Flaky Provider
adapter:
# Cache aggressively because the provider is unreliable.
cache_ttl_seconds: 120
cache_capacity: 4096
# Provider QPS is 50; leave 20% headroom.
rate_per_second: 40
rate_burst: 40
# Rolling window for failure counting. 60s default is fine for a steady
# 5K req/s replica.
circuit_failure_window_secs: 60
circuit_failure_threshold: 5
circuit_success_threshold: 2
circuit_reset_timeout_secs: 30
# Retry transient failures up to twice (3 total attempts), exponential
# backoff with deterministic jitter.
retry_max_retries: 2
retry_base_delay_ms: 100
retry_max_delay_ms: 5000
retry_jitter_fraction: 0.25
retry_strategy: exponential
# Default behavior: deny when the breaker is open.
circuit_open_verdict: deny
rate_limited_verdict: denyThe two changed-from-default knobs are cache_ttl_seconds (raised from 60 to 120) and retry_max_retries (lowered from 3 to 2). A flakier provider deserves a longer cache TTL (more hits, lower call volume) and fewer retries (don't amplify the load on a degraded service).
Next Steps
- Fail-Closed Posture · the design contract behind these defaults
- External Adapters · adapter wiring, scoped tool patterns, and HushSpec authoring
- Observability · which signals to watch for early breaker telemetry
- External Guards · operator contract for the six bundled providers