Chio/Docs

Velocity Guards

The velocity and agent-velocity guards are chio's rate-limiting layer. They sit in the guard pipeline before any budget store is touched, and they throttle both call frequency and monetary spend using a token-bucket algorithm. This page is a deep-dive: the bucket model, the integer arithmetic that keeps it drift-free, how burst is configured, what evidence appears in receipts, and how agent-velocity extends the mechanic across capabilities.

Per-guard reference lives in Guards

The Guards page summarizes every guard in the pipeline. This page is the longer-form reference for the two velocity guards. Other guards such as forbidden-path, egress-allowlist, and secret-leak are stateless and covered in the main Guards page.

What Velocity Does

velocity caps how often a single capability grant can be exercised and how much money it can spend within a sliding time window. A request that would exceed either cap returns Verdict::Deny and never reaches the downstream budget store, so it costs no money and leaves no reservation behind. Note that the kernel-wide Verdict enum has three variants (Allow, Deny, PendingApproval); PendingApproval is emitted only by the approval pipeline in chio-kernel, never by the stateless velocity guards. The guard is synchronous, cheap, and sits early enough in the pipeline that a rate-limited agent burns very little kernel time per denied call.

Two rate dimensions are tracked independently: invocation count and monetary spend. Either or both can be enabled on a grant. When both are set, both buckets must have capacity for a request to proceed. The invocation bucket is checked first. If that denies, the spend bucket is never consulted.

When the Guard Runs

The velocity guard evaluates on every tool invocation request, before the tool server is contacted. It runs after the cheapest stateless guards (forbidden-path, path-allowlist, shell-command, egress-allowlist, mcp-tool, and the write-oriented guards) so that requests denied for structural reasons do not touch the rate-limit buckets.


Token Bucket Model

Each velocity bucket is a classical token bucket with a capacity and a continuous refill rate. When a request arrives, the guard attempts to consume one token (for invocation counting) or a cost-weighted number of tokens (for spend). If the bucket has enough tokens, the request passes and the tokens are removed. If not, the guard denies the request without consuming anything.

The bucket is keyed by the pair (capability_id, grant_index). Every grant inside every capability token gets its own bucket. Two grants in the same capability do not share state. Two different capability tokens never share state. This isolation matters: an agent that holds two capabilities for the same tool sees two independent ceilings, one per capability.

bash
# Bucket key: (capability_id, grant_index)
#
# Same tool, same server, two capabilities => two buckets.
# Same capability, two grants for different tools => two buckets.
# Same capability, same grant => one bucket across all requests.
#
# Bucket state:
#   tokens_milli: i64        // current token balance in milli-tokens
#   last_refill_ms: i64      // last time we refilled (epoch ms)
#   capacity_milli: i64      // burst ceiling in milli-tokens
#   refill_rate_milli_per_ms: i64   // refill step per millisecond

Why Milli-Tokens

Classical token-bucket implementations use floating-point rates and balances. That opens a reproducibility hole: at very high call rates or very small windows, floating-point drift can accumulate across refill steps and deny or allow a request that the policy does not intend. To close this, chio's velocity guard scales every quantity by 1000 and does the arithmetic in signed 64-bit integers. One logical token equals 1000 internal milli-tokens. Refill is computed as (now_ms - last_refill_ms) * rate_milli_per_ms and saturates at capacity_milli.

The internal bucket balance and refill are integer milli-tokens; capacity is derived once per bucket via round(max_invocations * burst_factor) where burst_factor is an f64 on VelocityConfig. The result is floored at 1, cached as an integer, and never recomputed per request. Cost weighting is also integer: the grant's max_cost_per_invocation.units is already in minor currency units (cents, or the equivalent for other currencies). Each request deducts units * 1000 milli-tokens from the spend bucket, so the hot path (refill-and-consume, per request) runs entirely in integer arithmetic.

Why integer math over a mutex is enough

The bucket is protected by a std::sync::Mutex and held for the duration of a single refill-and-consume step. At the scales chio targets (single-digit millions of requests per day per kernel), mutex contention is negligible compared to the guard's own regex and glob checks upstream. The velocity guard is not the bottleneck.

Burst Capacity and Steady Rate

Two numbers fully describe a bucket: its capacity (burst ceiling) and its steady-state refill rate. Both are derived from the policy you write.

bash
# Given a VelocityConfig with:
#   max_invocations_per_window = N
#   window_secs                = W
#   burst_factor               = B   (f64, default 1.0)
#
# The guard derives:
#   capacity       = max(round(N * B), 1)       tokens
#   refill_rate    = N / W                      tokens per second
#
# Internally both are multiplied by 1000:
#   capacity_milli = max(round(N * B), 1) * 1000
#   refill_rate_milli_per_ms = (N * 1000) / (W * 1000) in fixed-point
#
# With burst_factor = 1.0, capacity equals N: no burst above steady rate.
# With burst_factor = 2.0, the agent can empty a 2N-sized bucket instantly
# and then must wait for tokens to refill at rate N/W per second.

The refill is continuous. There is no discrete window boundary that resets anything. If you configure max_invocations_per_window = 60 with window_secs = 60, you are saying the steady-state allowed rate is one call per second. Aburst_factor of 1.0 means the bucket holds at most one token on top of that: a single buffered call. A burst factor of 10.0 means up to ten calls can be fired in rapid succession before the steady rate takes over.

Default Configuration

The default VelocityConfig carries None for both invocation and spend ceilings, a 60-second window, and a burst factor of 1.0. That means by default the guard is effectively disabled: it allows every call. Rate limiting only kicks in when you explicitly set at least one of max_invocations_per_window or max_spend_per_window on the policy.

velocity_config.rs
use chio_guards::velocity::{VelocityConfig, VelocityGuard};

// 30 invocations per minute, no spend cap, no burst above steady rate.
let guard = VelocityGuard::new(VelocityConfig {
    max_invocations_per_window: Some(30),
    max_spend_per_window: None,
    window_secs: 60,
    burst_factor: 1.0,
});

// 10 invocations per minute, with a 20-call burst ceiling.
let guard = VelocityGuard::new(VelocityConfig {
    max_invocations_per_window: Some(10),
    max_spend_per_window: None,
    window_secs: 60,
    burst_factor: 2.0,
});

Policy YAML

In a hushspec or HushSpec-equivalent policy file the velocity guard is configured like this:

hushspec.yaml
rules:
  velocity:
    # Invocation bucket: 100 calls per 60s, with a 1.5x burst ceiling.
    max_invocations_per_window: 100
    window_secs: 60
    burst_factor: 1.5

    # Spend bucket: 10,000 minor units (e.g. cents) per 60s.
    # Requires max_cost_per_invocation on the matched grant.
    max_spend_per_window: 10000

Both max_invocations_per_window and max_spend_per_window are optional. Omit a field to disable the corresponding bucket. Omitting both is equivalent to not listing the guard at all.


Spend Bucket: Fail-Closed on Missing Metadata

The spend bucket reads the matched grant's max_cost_per_invocation.units to determine how many cost units a request consumes. That value is the planned cost of the call, not the observed cost. If a spend cap is configured but the matched grant does not carry a cost per invocation, the guard does not silently fall back to counting calls. It returns a kernel error, which the pipeline translates to a Deny verdict.

No implicit fallback

A policy that enables max_spend_per_window on a grant with no cost metadata is a misconfiguration. The velocity guard will fail every call on that grant until the grant carries a planned cost. This is intentional: silently treating missing cost as zero would let spend ceilings be bypassed by dropping cost metadata.

The same rule applies even if the invocation bucket would allow the call. Both buckets must be checkable. If either bucket cannot be evaluated for lack of required metadata, the guard denies.


Agent-Velocity: Cross-Capability Throttle

velocity is keyed on (capability_id, grant_index). That is the right granularity for enforcing per-capability ceilings, but it has a gap: an agent that holds many capabilities can stay under each individual ceiling and still generate an enormous aggregate call rate. agent-velocity closes that gap.

The mechanism is the same token bucket with the same milli-token arithmetic. The key is different: buckets are indexed by agent identity (the subject public key on the capability chain) rather than by capability. Every tool call the agent makes, regardless of which capability is being exercised, consumes from the same bucket.

bash
# velocity          key: (capability_id, grant_index)
# agent-velocity    key: agent_subject_key
#
# A hot-cycling agent that rotates across 20 capabilities can still
# be throttled by agent-velocity because all 20 capabilities feed
# the same per-agent bucket.

When to Enable It

Turn on agent-velocity for agents that hold more than one capability, for shared agents accessed by many end users, and for any workload where a runaway loop across capabilities is a realistic failure mode. Turn it off (or set very high ceilings) for trusted, single-purpose agents where per-capability velocity is already a tight enough cap.

Agent-Velocity Configuration

hushspec.yaml
rules:
  agent_velocity:
    enabled: true
    max_invocations_per_window: 500
    max_spend_per_window: 50000
    window_secs: 60
    burst_factor: 2.0

Configuration keys mirror the per-capability velocity guard. The internal machinery is identical. Only the bucket key differs, which is why the two guards can be composed independently on the same pipeline: a request that passes the agent-level cap still has to pass the per-capability cap, and vice versa.


Composition with Monetary Budgets

Velocity guards and monetary budgets are independent enforcement layers. The velocity guard runs in the guard pipeline, which happens before the kernel attempts to charge the budget store. A request denied by velocity never reaches try_charge_cost, so it consumes no budget and creates no reservation. A request that passes velocity but runs out of budget is denied at the charge step with a distinct reason code.

This layering matters for observability. A dashboard that sees many denies by velocity is watching a rate-shaping problem. A dashboard that sees many denies by the budget store is watching a spend-ceiling problem. Both should be visible; they are not the same failure mode.


Receipt Evidence

Every receipt carries an evidence array with one entry per guard that actually ran. The velocity guard writes an entry whether it allowed or denied the request. The entry captures bucket state before and after the check, so post-hoc analysis can reconstruct why a given request passed or was throttled.

receipt-evidence.json
{
  "evidence": [
    {
      "guard_name": "velocity",
      "verdict": true,
      "details": "invocation bucket: 87.5 / 100 tokens; consumed 1 token; refill rate 100/60s"
    },
    {
      "guard_name": "agent-velocity",
      "verdict": true,
      "details": "agent bucket: 412 / 500 tokens; consumed 1 token"
    }
  ]
}

On a denial the details string records the bucket balance at check time and the token shortfall, so an operator can see at a glance whether the request was a hair over the ceiling or flagrantly above it.

deny-evidence.json
{
  "evidence": [
    {
      "guard_name": "velocity",
      "verdict": false,
      "details": "invocation bucket exhausted: balance 0.0 / 100, needed 1 token; next refill in 420ms"
    }
  ],
  "decision": {
    "deny": {
      "reason": "velocity bucket exhausted",
      "guard": "velocity"
    }
  }
}

Observability

Velocity is a frequent source of operator questions: why did this call get rate-limited, what ceiling tripped, how close are we to denial. The recommended patterns:

  • Query deny receipts by guard name. A spike in velocity denies versus agent-velocity denies tells you whether one capability is being hammered or an agent is hot-cycling across many.
  • Plot bucket balance over time. The details string on allow receipts includes the post-consume balance. Graph that per capability to see how close the steady-state rate is to the ceiling.
  • Correlate with budget-store denies. If velocity denies are rare but budget denies are common, velocity is too loose. If velocity denies dominate and budget denies never fire, the rate cap is doing most of the work.

No metrics endpoint

The velocity guard does not publish bucket state through a separate metrics endpoint. All operator-visible state is in the receipt evidence stream. Feed the receipt query API into your metrics system if you want dashboards.

Failure Modes

A short catalog of what can go wrong and what the guard does about it:

ScenarioGuard Behavior
Invocation bucket empty, refill has not caught upDeny with detail "bucket exhausted" and next-refill ETA
Spend cap set, grant has no cost metadataGuard returns Err; pipeline denies (fail-closed)
System clock jumps backwardRefill step is treated as zero; bucket does not over-refill
System clock jumps far forwardBucket refills up to capacity_milli and saturates there
Kernel restartBuckets are in-memory; restart resets all buckets to full capacity
Two concurrent requests race on the same bucketMutex serializes refill-and-consume; one wins, one sees the updated balance
Policy sets burst_factor below 1.0Capacity is still at least one token; the guard does not allow a zero-token bucket

In-memory buckets are ephemeral

Velocity buckets are not persisted. A kernel restart wipes them. In practice this is fine because the window is seconds to minutes and the first few requests after restart re-establish a realistic balance within one window. If you run a cluster of kernels, each kernel maintains its own buckets. The ceiling is effectively multiplied by the number of kernels. Budget the ceilings with this in mind, or push rate limiting to a shared upstream if you need a cluster-global cap.

Worked Example

Suppose an agent holds a capability with max_invocations_per_window = 6, window_secs = 60, and burst_factor = 1.0. The agent fires seven requests in rapid succession.

#Time (ms)Balance pre (tokens)Refill creditBalance postVerdict
106.0000.0005.000Allow
2205.0000.0024.002Allow
3404.0020.0023.004Allow
4603.0040.0022.006Allow
5802.0060.0021.008Allow
61001.0080.0020.010Allow
71200.0100.0020.012Deny

The seventh call finds the bucket at 0.012 tokens, short of the 1.0 it needs, and is denied. The next allowed request is roughly 10 seconds later, when enough milli-tokens have refilled to cross the 1.0 threshold. Notice that the refill step (0.1 tokens per second at 6-per-60s steady rate) is computed as 20 ms * (6 * 1000 / 60000) = 2 milli-tokens per 20 ms interval, which round-trips through integer arithmetic exactly.


Cluster Deployments

Velocity buckets live in the kernel process. A chio deployment with three kernel replicas has three independent bucket sets per capability. An agent pinned to a single kernel by session affinity sees the configured ceiling. An agent load-balanced across replicas sees an effective ceiling of three times the configured value.

This is a deliberate design choice. Centralizing rate limits across a cluster requires a shared counter, which adds a hot write and a new failure domain. For most chio workloads the per-replica cap is accurate enough, and global shape can be enforced further upstream (at the ingress, at the agent SDK, or by a central agent-velocity policy applied at the trust-control boundary). If you need true cluster-wide velocity, run a single kernel replica with standby failover, or front the cluster with an ingress that does its own rate shaping.


Testing Velocity Policies

Velocity is easy to test in isolation because the guard is synchronous and deterministic under a monotonic clock. The recommended harness: build a VelocityGuard with a known config, feed it a sequence of requests with controlled timestamps, and assert verdicts. The guard exposes a test-only API for injecting a fake clock so your tests do not depend on wall time.

velocity_test.rs
#[test]
fn bucket_denies_when_empty_and_refills_over_time() {
    let mut clock = FakeClock::new(0);
    let guard = VelocityGuard::with_clock(
        VelocityConfig {
            max_invocations_per_window: Some(3),
            max_spend_per_window: None,
            window_secs: 60,
            burst_factor: 1.0,
        },
        clock.handle(),
    );

    // Drain the bucket.
    assert!(guard.evaluate(&ctx()).unwrap().is_allow());
    assert!(guard.evaluate(&ctx()).unwrap().is_allow());
    assert!(guard.evaluate(&ctx()).unwrap().is_allow());

    // Fourth call is immediate; bucket is empty.
    assert!(guard.evaluate(&ctx()).unwrap().is_deny());

    // Advance 20 seconds: one token should have refilled (3 per 60s).
    clock.advance_ms(20_000);
    assert!(guard.evaluate(&ctx()).unwrap().is_allow());
}

Summary

  • velocity is a token-bucket rate limiter keyed by (capability_id, grant_index), with separate buckets for invocation count and monetary spend.
  • agent-velocity runs the same algorithm keyed on agent identity, catching cross-capability hot-cycling.
  • Bucket math is integer milli-tokens. No floating-point drift.
  • Capacity is round(max_invocations * burst_factor), floored at 1. Refill is continuous at max_invocations / window_secs.
  • Missing cost metadata on a spend-bucketed grant is a fail-closed error, not a silent bypass.
  • Receipts capture bucket balance before and after every check.
  • Buckets are per-replica. Multi-kernel clusters scale the effective ceiling. Plan accordingly or enforce globally upstream.