Differential Tests · Chio Docs

Coverage and limit

Differential testing detects drift between two implementations of the same behavior; it cannot establish that behavior is correct. If both the spec and the production code share the same bug, the differential test will not flag it. The spec is therefore written for clarity, in a different style and with no shared code.

What Differential Testing Does

proptest generates random inputs (or runs a regression seed), runs both implementations, and asserts the outputs are equal. The harness shrinks any failing input toward a minimal example that still produces a divergence; the regression seed is then committed under tests/*.proptest-regressions so subsequent runs exercise it first.

From src/lib.rs:

formal/diff-tests/src/lib.rs

//! Differential testing: executable Chio reference spec vs the production
//! capability structs and the normalized proof-facing AST in chio-kernel-core.
//!
//! This crate is the shipped proof-style release gate for scope attenuation
//! semantics. Lean assets remain advisory until they are root-imported and
//! sorry-free.

pub mod generators;
pub mod spec;

The two top-level modules are:

spec · the executable reference. Mirrors chio_core::capability::* types and reimplements is_subset_of for SpecToolGrant, SpecResourceGrant, SpecPromptGrant, and SpecChioScope without calling into chio-core.
generators · proptest strategies that produce paired (spec, impl) values built from the same source bits. The pair conversion is what guarantees the two implementations see equivalent inputs.

The Harnesses

Six test files under tests/ exercise four proof-relevant surfaces.

File	Behavior	Lines
`scope_diff.rs`	Scope, tool grant, resource grant, prompt grant subsumption; both `ChioScope` and `NormalizedScope`	811
`canonical_json_diff.rs`	RFC 8785 JCS canonicalization against an independently-implemented oracle	616
`browser_canonical_json_diff.rs`	Same canonical-JSON properties, exercised through the wasm-target build	420
`anchored_root.rs`	Anchored Merkle root tuples in deterministic order; Rust vs TypeScript	511
`anchored_root_tamper.rs`	Tampering tests for anchored roots; fail-closed under any single byte flip	344
`receipt_encoding_diff.rs`	Receipt encode/decode roundtrip; production vs spec; shape-preservation under tampering	601

Scope Subsumption

The headline harness pairs the spec's SpecChioScope::is_subset_of against the shipped chio_core::capability::ChioScope::is_subset_of and the normalized proof-facing NormalizedScope::is_subset_of.

formal/diff-tests/tests/scope_diff.rs

/// Core differential test: scope subset produces the same result in spec and impl.
#[test]
fn scope_subset_spec_matches_impl(
    ((spec_a, impl_a), (spec_b, impl_b)) in arb_paired_scope_pair()
) {
    let spec_result = spec_a.is_subset_of(&spec_b);
    let impl_result = impl_a.is_subset_of(&impl_b);

    prop_assert_eq!(
        spec_result, impl_result,
        "Scope subset mismatch! spec: {} impl: {} child grants: {} parent grants: {}",
        spec_result, impl_result, spec_a.grants.len(), spec_b.grants.len()
    );
}

Twelve differential properties run in this file: the four base spec-vs-impl subset checks (scope, tool grant, resource grant, prompt grant), plus eight normalized variants that pair a projection and a subset check for each of the four grant kinds. Each runs 256 cases by default; bump via PROPTEST_CASES:

bash

PROPTEST_CASES=4096 cargo test -p chio-formal-diff-tests scope_diff

The spec source is src/spec.rs with SpecToolGrant::is_subset_of written for clarity. It mirrors the shipped Rust implementation but does not call into it, so a regression in chio_core shows up as a divergence rather than a self-confirming pass.

Canonical JSON

canonical_json_diff.rs cross-checks chio_core::canonical::canonicalize against an independently implemented oracle defined in the same file. The oracle is deliberately small and exercises a subset of production code paths (no f64 ryu shortest-form; the strategy is restricted to integer numbers so the two implementations agree byte-for-byte).

The harness asserts twelve named RFC 8785 properties:

Property	Statement
`idempotence`	`canonicalize(canonicalize(x)) == canonicalize(x)`
`key_sort_utf16`	Object keys sort by UTF-16 code unit order
`no_insignificant_whitespace`	Output has no whitespace outside string literals
`integer_no_decimal_point`	Integer numbers serialize without trailing `.0`
`string_minimal_escaping`	Only RFC 8785 required characters are escaped
`parse_round_trip_equal`	`parse(canonicalize(x))` is semantically equal to `x`
`byte_stable_oracle_match`	Production output matches the independent oracle byte-for-byte
`valid_utf8_output`	Output is always valid UTF-8
`determinism`	Two independent calls produce byte-equal output
`null_bool_literals`	`null`, `true`, `false` serialize exactly
`empty_collections`	`{}` and `[]` serialize exactly
`nan_infinity_rejected`	Non-finite numbers fail canonicalization

The browser variant in browser_canonical_json_diff.rs runs the same properties through wasm-bindgen-test for wasm32 targets, comparing desktop and browser kernel results.

Anchored Roots

anchored_root.rs compares the Rust and TypeScript implementations of anchored Merkle root tuples. Three named tests:

rust_anchored_root_tuples_emit_in_deterministic_order
rust_and_typescript_anchored_root_tuples_match_by_root_bytes_and_proof_structure
rust_anchored_root_tuples_fail_closed_on_invalid_fixture_root

The companion file anchored_root_tamper.rs complements with tampering tests: every single byte flip in the anchored root must produce a verification failure. The implementation must fail closed on any tampering, not silently accept the modified root.

Receipt Encoding

receipt_encoding_diff.rs runs encode/decode roundtrip tests across the receipt body plus tampering tests across each named field. Receipt fields are coupled (capability, request, verdict, policy hash, evidence class) per the proof manifest's P4 property; the differential check is what catches drift between the spec's coupling rule and the production implementation.

Running

Run the manifest gate:

manifest gate

cargo test -p chio-formal-diff-tests

cargo test stdout (passing)

running 14 tests
test scope_subset_spec_matches_impl ... ok
test grant_subset_spec_matches_impl ... ok
test resource_grant_subset_spec_matches_impl ... ok
test prompt_grant_subset_spec_matches_impl ... ok
test normalized_scope_subset_spec_matches_impl ... ok
test canonical_json_idempotence ... ok
test canonical_json_byte_stable_oracle_match ... ok
test rust_anchored_root_tuples_emit_in_deterministic_order ... ok
test receipt_encoding_roundtrip ... ok
... 5 more

test result: ok. 14 passed; 0 failed; 0 ignored

Run a single file:

bash

cargo test -p chio-formal-diff-tests --test scope_diff
cargo test -p chio-formal-diff-tests --test canonical_json_diff
cargo test -p chio-formal-diff-tests --test anchored_root
cargo test -p chio-formal-diff-tests --test receipt_encoding_diff

Run with a higher case count (slower; nightly-style):

bash

PROPTEST_CASES=4096 cargo test -p chio-formal-diff-tests

Run the wasm32 variants:

bash

wasm-pack test --node \
  --features browser \
  formal/diff-tests

Example Passing Run

The full lane runs every test file in formal/diff-tests/tests/ plus any inline #[test] functions in the test files. The names below come straight from the source: scope tests are scope_diff.rs, canonical-JSON tests are canonical_json_diff.rs, and so on.

cargo test -p chio-formal-diff-tests (passing)

     Running unittests src/lib.rs (target/debug/deps/chio_formal_diff_tests-...)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

     Running tests/anchored_root.rs (target/debug/deps/anchored_root-...)

running 4 tests
test rust_anchored_root_tuples_emit_in_deterministic_order ... ok
test rust_canonical_receipt_body_matches_blessed_vector_bytes ... ok
test rust_and_typescript_anchored_root_tuples_match_by_root_bytes_and_proof_structure ... ok
test rust_anchored_root_tuples_fail_closed_on_invalid_fixture_root ... ok

test result: ok. 4 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

     Running tests/anchored_root_tamper.rs (target/debug/deps/anchored_root_tamper-...)

running 1 test
test rust_anchored_root_tuples_fail_closed_under_single_byte_tamper ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

     Running tests/canonical_json_diff.rs (target/debug/deps/canonical_json_diff-...)

running 11 tests
test idempotence ... ok
test key_sort_utf16 ... ok
test no_insignificant_whitespace ... ok
test integer_no_decimal_point ... ok
test string_minimal_escaping ... ok
test parse_round_trip_equal ... ok
test byte_stable_oracle_match ... ok
test valid_utf8_output ... ok
test determinism ... ok
test null_bool_literals ... ok
test empty_collections ... ok
test nan_infinity_rejected ... ok

test result: ok. 12 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

     Running tests/receipt_encoding_diff.rs (target/debug/deps/receipt_encoding_diff-...)

running 2 tests
test canonical_receipt_body_is_idempotent ... ok
test canonical_receipt_body_round_trip_is_stable ... ok

test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

     Running tests/scope_diff.rs (target/debug/deps/scope_diff-...)

running 24 tests
test scope_subset_spec_matches_impl ... ok
test grant_subset_spec_matches_impl ... ok
test resource_grant_subset_spec_matches_impl ... ok
test prompt_grant_subset_spec_matches_impl ... ok
test normalized_scope_projection_matches_spec ... ok
test normalized_scope_subset_spec_matches_impl ... ok
test empty_scope_is_subset ... ok
test subset_transitivity ... ok
... 16 more

test result: ok. 24 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Each proptest! block runs PROPTEST_CASES random pairs (default 256) plus any committed regression seeds. The proptest strategies are defined in src/generators.rs and return paired (spec, impl) values built from the same source bits, so a single random draw drives both implementations through identical input.

What a Divergence Looks Like

When the spec and the production implementation disagree on any case, proptest panics, dumps the failing input, then activates its shrinker. The shrinker repeatedly tries smaller inputs that still trigger the panic, converging on a minimal counterexample. A failing run looks like this:

cargo test stdout (failing scope_diff)

     Running tests/scope_diff.rs (target/debug/deps/scope_diff-...)

running 24 tests
test scope_subset_spec_matches_impl ... FAILED

failures:

---- scope_subset_spec_matches_impl stdout ----
thread 'scope_subset_spec_matches_impl' panicked at tests/scope_diff.rs:46:9:
Test failed: Scope subset mismatch!
  spec: true
  impl: false
  child grants: 1
  parent grants: 1
minimal failing input: ((spec_a, impl_a), (spec_b, impl_b)) =
  (
    (SpecChioScope { grants: [SpecToolGrant {
        server_id: "a",
        tool_name: "t",
        operations: [Invoke],
        constraints: [PathPrefix("/x")],
        max_invocations: None, max_cost_per_invocation: None,
        max_total_cost: None, dpop_required: None,
    }], resource_grants: [], prompt_grants: [] },
     ChioScope { /* paired impl value built from the same bits */ }),
    (SpecChioScope { grants: [SpecToolGrant {
        server_id: "a",
        tool_name: "t",
        operations: [Invoke],
        constraints: [PathPrefix("/x/")],
        ...
    }], resource_grants: [], prompt_grants: [] },
     ChioScope { /* ... */ })
  )

successes: 142
local rejects: 0
global rejects: 0

Reading the trace:

spec: true, impl: false. The reference says the child scope is a subset of the parent; the production code says no. Either side could be wrong.
142 successes before shrink. The first 142 random draws agreed; the 143rd produced the divergence. The shrinker then minimized that divergence to the printed shape.
Path constraint: /x vs /x/. The minimized input pinpoints the disagreement to the path-prefix constraint's trailing-slash handling. That is the bug to fix.

proptest captures the failing input in a regression file under tests/*.proptest-regressions. The format is one seed per line, e.g. cc 9c1f41a9b6... (CC marks a committed regression). That seed is exercised first on every subsequent run, so once the bug is fixed the seed pins the counterexample forever. Triage path:

Reproduce: re-run; the regression seed exercises the failure on the first case.
Classify: spec bug (the spec disagrees with the spec document), implementation bug (the production code drifted), generator bug (the proptest strategy produces an out-of-domain input).
Fix and rerun; when green, commit the regression seed alongside the fix so the test pins the counterexample forever.

Relation to Other Evidence

Differential testing is a release gate for scope-attenuation semantics. The source comment names its status explicitly: in the comment at the top of src/lib.rs: "Lean assets remain advisory until they are root-imported and sorry-free".

The role split:

Aeneas plus Lean · proves properties of the handwritten model and selected extracted helpers under their stated assumptions.
Kani · proves the checks selected Rust functions within configured bounds.
Differential tests · detect drift between two implementations of the same behavior across many random inputs.

These mechanisms cover different models and boundaries. A regression outside all three may still be exercised by fuzzing or integration tests, which provide different coverage rather than a proof.

P1: capability_monotonicity in this tool

P1 says a child capability's bounds are tighter than its parent's. The differential lane catches drift between the executable Lean-style spec and the production Rust by running both implementations of is_subset_of on random paired inputs and asserting they agree. The headline harness is scope_subset_spec_matches_impl in tests/scope_diff.rs; eight further P1-style properties live in the same file. Verbatim from formal/diff-tests/tests/scope_diff.rs:

formal/diff-tests/tests/scope_diff.rs

/// Core differential test: scope subset produces the same result in spec and impl.
#[test]
fn scope_subset_spec_matches_impl(
    ((spec_a, impl_a), (spec_b, impl_b)) in arb_paired_scope_pair()
) {
    let spec_result = spec_a.is_subset_of(&spec_b);
    let impl_result = impl_a.is_subset_of(&impl_b);

    prop_assert_eq!(
        spec_result, impl_result,
        "Scope subset mismatch!\n  spec: {}\n  impl: {}\n  child grants: {}\n  parent grants: {}",
        spec_result, impl_result, spec_a.grants.len(), spec_b.grants.len()
    );
}

The proptest strategy arb_paired_scope_pair() produces two paired (SpecChioScope, ChioScope) values built from the same source bits. Because the pair is built once and split into spec and production sides, the two implementations receive equivalent inputs; if the assertion fails, it identifies a result difference for that generated pair. The default case count is 256 per harness; bump via PROPTEST_CASES=4096 for nightly-style runs.

The same file pins eight P1 attenuation rules as standalone properties (scope_diff.rs:634-811): empty scope is a subset of any scope, scope subset is reflexive, removing a grant produces a subset, removing an operation produces a subset, reducing max_invocations produces a subset, wildcard tool name subsumes any specific tool name, different servers never produce a subset, and subset is transitive across a child <= parent <= grandparent chain. Each is asserted on the spec side independently of the production side, so a regression in either implementation surfaces as a different failure mode: the _spec_matches_impl tests catch drift, and the standalone properties catch the spec drifting from its own intent.

For example, consider a future change to ChioScope::is_subset_of in chio-core-types that accidentally accepts a wider child when parent.max_invocations is None. The Lean spec and the spec-side SpecToolGrant::is_subset_of in src/spec.rs have not changed; the Aeneas extraction has not been re-run yet; the Kani harness happens to use a different concrete shape that does not hit the bug. proptest hands scope_subset_spec_matches_impl an input where the spec returns false and the new production code returns true; the harness fails, proptest shrinks toward a minimal counterexample, and the regression seed is committed under tests/scope_diff.proptest-regressions. The differential lane runs on every PR, so a drift like this is reported during review.

Back to the P1 Tour for the wrap-up across all five stations.

Aeneas Pipeline · the proof-side lane for the same surfaces.
Lean 4 Proofs · the proof structure that the differential tests complement.
Kani Harnesses · the bounded model-checking lane on the same Rust paths.

What Differential Testing Does

The Harnesses

Scope Subsumption

Canonical JSON

Anchored Roots

Receipt Encoding

Running

Example Passing Run

What a Divergence Looks Like

Relation to Other Evidence

P1: capability_monotonicity in this tool

Next