Differential Tests
The crate chio-formal-diff-tests runs proptest harnesses that compare an executable reference specification against the production implementation. When the two disagree on any random input, the test fails and shrinks the input toward a minimal counterexample. The crate lives at formal/diff-tests/.
Differential, not exhaustive
What Differential Testing Does
proptest generates random inputs (or runs a regression seed), runs both implementations, and asserts the outputs are equal. The harness shrinks any failing input toward a minimal example that still produces a divergence; the regression seed is then committed under tests/*.proptest-regressions so future runs always exercise it first.
From src/lib.rs:
//! Differential testing: executable Chio reference spec vs the production
//! capability structs and the normalized proof-facing AST in chio-kernel-core.
//!
//! This crate is the shipped proof-style release gate for scope attenuation
//! semantics. Lean assets remain advisory until they are root-imported and
//! sorry-free.
pub mod generators;
pub mod spec;Two top-level modules carry the load:
spec· the executable reference. Mirrorschio_core::capability::*types and reimplementsis_subset_offorSpecToolGrant,SpecResourceGrant,SpecPromptGrant, andSpecChioScopewithout calling intochio-core.generators· proptest strategies that produce paired(spec, impl)values built from the same source bits. The pair conversion is what guarantees the two implementations see equivalent inputs.
The Harnesses
Six test files under tests/ exercise four proof-relevant surfaces.
| File | Surface | Lines |
|---|---|---|
scope_diff.rs | Scope, tool grant, resource grant, prompt grant subsumption; both ChioScope and NormalizedScope | 375 |
canonical_json_diff.rs | RFC 8785 JCS canonicalization against an independently-implemented oracle | 614 |
browser_canonical_json_diff.rs | Same canonical-JSON properties, exercised through the wasm-target build | 336 |
anchored_root.rs | Anchored Merkle root tuples in deterministic order; Rust vs TypeScript | 495 |
anchored_root_tamper.rs | Tampering tests for anchored roots; fail-closed under any single byte flip | 327 |
receipt_encoding_diff.rs | Receipt encode/decode roundtrip; production vs spec; shape-preservation under tampering | 579 |
Scope Subsumption
The headline harness pairs the spec's SpecChioScope::is_subset_of against the shipped chio_core::capability::ChioScope::is_subset_of and the normalized proof-facing NormalizedScope::is_subset_of.
/// Core differential test: scope subset produces the same result in spec and impl.
#[test]
fn scope_subset_spec_matches_impl(
((spec_a, impl_a), (spec_b, impl_b)) in arb_paired_scope_pair()
) {
let spec_result = spec_a.is_subset_of(&spec_b);
let impl_result = impl_a.is_subset_of(&impl_b);
prop_assert_eq!(
spec_result, impl_result,
"Scope subset mismatch! spec: {} impl: {} child grants: {} parent grants: {}",
spec_result, impl_result, spec_a.grants.len(), spec_b.grants.len()
);
}Eight differential properties run in this file: scope subset, tool grant subset, resource grant subset, prompt grant subset, plus the four normalized variants. Each runs 256 cases by default; bump via PROPTEST_CASES:
PROPTEST_CASES=4096 cargo test -p chio-formal-diff-tests scope_diffThe spec source is src/spec.rs with SpecToolGrant::is_subset_of written for clarity. It mirrors the shipped Rust implementation but does not call into it, so a regression in chio_core shows up as a divergence rather than a self-confirming pass.
Canonical JSON
canonical_json_diff.rs cross-checks chio_core::canonical::canonicalize against an independently implemented oracle defined in the same file. The oracle is deliberately small and exercises a subset of production code paths (no f64 ryu shortest-form; the strategy is restricted to integer numbers so the two implementations agree byte-for-byte).
The harness asserts twelve named RFC 8785 properties:
| Property | Statement |
|---|---|
idempotence | canonicalize(canonicalize(x)) == canonicalize(x) |
key_sort_utf16 | Object keys sort by UTF-16 code unit order |
no_insignificant_whitespace | Output has no whitespace outside string literals |
integer_no_decimal_point | Integer numbers serialize without trailing .0 |
string_minimal_escaping | Only RFC 8785 required characters are escaped |
parse_round_trip_equal | parse(canonicalize(x)) is semantically equal to x |
byte_stable_oracle_match | Production output matches the independent oracle byte-for-byte |
valid_utf8_output | Output is always valid UTF-8 |
determinism | Two independent calls produce byte-equal output |
null_bool_literals | null, true, false serialize exactly |
empty_collections | {} and [] serialize exactly |
nan_infinity_rejected | Non-finite numbers fail canonicalization |
The browser variant in browser_canonical_json_diff.rs runs the same properties through wasm-bindgen-test for wasm32 targets so the parity between the desktop and browser kernels is checked end to end.
Anchored Roots
anchored_root.rs compares the Rust and TypeScript implementations of anchored Merkle root tuples. Three named tests:
rust_anchored_root_tuples_emit_in_deterministic_orderrust_and_typescript_anchored_root_tuples_match_by_root_bytes_and_proof_structurerust_anchored_root_tuples_fail_closed_on_invalid_fixture_root
The companion file anchored_root_tamper.rs complements with tampering tests: every single byte flip in the anchored root must produce a verification failure. The implementation must fail closed on any tampering, not silently accept the modified root.
Receipt Encoding
receipt_encoding_diff.rs runs encode/decode roundtrip tests across the receipt body plus tampering tests across each named field. Receipt fields are coupled (capability, request, verdict, policy hash, evidence class) per the proof manifest's P4 property; the differential check is what catches drift between the spec's coupling rule and the production implementation.
Running
Run the full lane (this is the manifest gate):
cargo test -p chio-formal-diff-testsrunning 14 tests
test scope_subset_spec_matches_impl ... ok
test tool_grant_subset_spec_matches_impl ... ok
test resource_grant_subset_spec_matches_impl ... ok
test prompt_grant_subset_spec_matches_impl ... ok
test normalized_scope_subset_spec_matches_impl ... ok
test canonical_json_idempotence ... ok
test canonical_json_byte_stable_oracle_match ... ok
test rust_anchored_root_tuples_emit_in_deterministic_order ... ok
test receipt_encoding_roundtrip ... ok
... 5 more
test result: ok. 14 passed; 0 failed; 0 ignoredRun a single file:
cargo test -p chio-formal-diff-tests --test scope_diff
cargo test -p chio-formal-diff-tests --test canonical_json_diff
cargo test -p chio-formal-diff-tests --test anchored_root
cargo test -p chio-formal-diff-tests --test receipt_encoding_diffRun with a higher case count (slower; nightly-style):
PROPTEST_CASES=4096 cargo test -p chio-formal-diff-testsRun the wasm32 variants:
wasm-pack test --node \
--features browser \
formal/diff-testsAn Actual Passing Run
The full lane runs every test file in formal/diff-tests/tests/ plus any inline #[test] functions in the test files. The names below come straight from the source: scope tests are scope_diff.rs, canonical-JSON tests are canonical_json_diff.rs, and so on.
Running unittests src/lib.rs (target/debug/deps/chio_formal_diff_tests-...)
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running tests/anchored_root.rs (target/debug/deps/anchored_root-...)
running 4 tests
test rust_anchored_root_tuples_emit_in_deterministic_order ... ok
test rust_canonical_receipt_body_matches_blessed_vector_bytes ... ok
test rust_and_typescript_anchored_root_tuples_match_by_root_bytes_and_proof_structure ... ok
test rust_anchored_root_tuples_fail_closed_on_invalid_fixture_root ... ok
test result: ok. 4 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running tests/anchored_root_tamper.rs (target/debug/deps/anchored_root_tamper-...)
running 1 test
test rust_anchored_root_tuples_fail_closed_under_single_byte_tamper ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running tests/canonical_json_diff.rs (target/debug/deps/canonical_json_diff-...)
running 11 tests
test idempotence ... ok
test key_sort_utf16 ... ok
test no_insignificant_whitespace ... ok
test integer_no_decimal_point ... ok
test string_minimal_escaping ... ok
test parse_round_trip_equal ... ok
test byte_stable_oracle_match ... ok
test valid_utf8_output ... ok
test determinism ... ok
test null_bool_literals ... ok
test empty_collections ... ok
test nan_infinity_rejected ... ok
test result: ok. 12 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running tests/receipt_encoding_diff.rs (target/debug/deps/receipt_encoding_diff-...)
running 2 tests
test canonical_receipt_body_is_idempotent ... ok
test canonical_receipt_body_round_trip_is_stable ... ok
test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running tests/scope_diff.rs (target/debug/deps/scope_diff-...)
running 6 tests
test scope_subset_spec_matches_impl ... ok
test grant_subset_spec_matches_impl ... ok
test resource_grant_subset_spec_matches_impl ... ok
test prompt_grant_subset_spec_matches_impl ... ok
test normalized_scope_projection_matches_spec ... ok
test normalized_scope_subset_spec_matches_impl ... ok
test result: ok. 6 passed; 0 failed; 0 ignored; 0 measured; 0 filtered outEach proptest! block runs PROPTEST_CASES random pairs (default 256) plus any committed regression seeds. The proptest strategies are defined in src/generators.rs and return paired (spec, impl) values built from the same source bits, so a single random draw drives both implementations through identical input.
What a Divergence Looks Like
When the spec and the production implementation disagree on any case, proptest panics, dumps the failing input, then activates its shrinker. The shrinker repeatedly tries smaller inputs that still trigger the panic, converging on a minimal counterexample. A failing run looks like this:
Running tests/scope_diff.rs (target/debug/deps/scope_diff-...)
running 6 tests
test scope_subset_spec_matches_impl ... FAILED
failures:
---- scope_subset_spec_matches_impl stdout ----
thread 'scope_subset_spec_matches_impl' panicked at tests/scope_diff.rs:46:9:
Test failed: Scope subset mismatch!
spec: true
impl: false
child grants: 1
parent grants: 1
minimal failing input: ((spec_a, impl_a), (spec_b, impl_b)) =
(
(SpecChioScope { grants: [SpecToolGrant {
server_id: "a",
tool_name: "t",
operations: [Invoke],
constraints: [PathPrefix("/x")],
max_invocations: None, max_cost_per_invocation: None,
max_total_cost: None, dpop_required: None,
}], resource_grants: [], prompt_grants: [] },
ChioScope { /* paired impl value built from the same bits */ }),
(SpecChioScope { grants: [SpecToolGrant {
server_id: "a",
tool_name: "t",
operations: [Invoke],
constraints: [PathPrefix("/x/")],
...
}], resource_grants: [], prompt_grants: [] },
ChioScope { /* ... */ })
)
successes: 142
local rejects: 0
global rejects: 0Reading the trace:
- spec: true, impl: false. The reference says the child scope is a subset of the parent; the production code says no. Either side could be wrong.
- 142 successes before shrink. The first 142 random draws agreed; the 143rd produced the divergence. The shrinker then minimized that divergence to the printed shape.
- Path constraint:
/xvs/x/. The minimized input pinpoints the disagreement to the path-prefix constraint's trailing-slash handling. That is the bug to fix.
proptest captures the failing input in a regression file under tests/*.proptest-regressions. The format is one seed per line, e.g. cc 9c1f41a9b6... (CC marks a committed regression). That seed is exercised first on every subsequent run, so once the bug is fixed the seed pins the counterexample forever. Triage path:
- Reproduce: re-run; the regression seed exercises the failure on the first case.
- Classify: spec bug (the spec disagrees with the spec document), implementation bug (the production code drifted), generator bug (the proptest strategy produces an out-of-domain input).
- Fix and rerun; when green, commit the regression seed alongside the fix so the test pins the counterexample forever.
How It Relates to the Proofs
Differential testing sits at a different point on the assurance pyramid than the formal lanes. It is the shipped proof-style release gate for scope attenuation semantics, named explicitly in the comment at the top of src/lib.rs: "Lean assets remain advisory until they are root-imported and sorry-free".
The role split:
- Aeneas plus Lean · proves the bounded model is correct.
- Kani · proves the production Rust matches the model on bounded inputs.
- Differential tests · catch drift between two implementations of the same surface across many random inputs.
None of these replaces the others. A regression that all three layers miss is, by construction, not in any of their domains; the next layer down (fuzz, integration tests) would catch it, with weaker but broader coverage.
P1: capability_monotonicity in this tool
P1 says a child capability's bounds are tighter than its parent's. The differential lane catches drift between the executable Lean-style spec and the production Rust by running both implementations of is_subset_of on random paired inputs and asserting they agree. The headline harness is scope_subset_spec_matches_impl in tests/scope_diff.rs; eight further P1-style properties live in the same file. Verbatim from formal/diff-tests/tests/scope_diff.rs:
/// Core differential test: scope subset produces the same result in spec and impl.
#[test]
fn scope_subset_spec_matches_impl(
((spec_a, impl_a), (spec_b, impl_b)) in arb_paired_scope_pair()
) {
let spec_result = spec_a.is_subset_of(&spec_b);
let impl_result = impl_a.is_subset_of(&impl_b);
prop_assert_eq!(
spec_result, impl_result,
"Scope subset mismatch!\n spec: {}\n impl: {}\n child grants: {}\n parent grants: {}",
spec_result, impl_result, spec_a.grants.len(), spec_b.grants.len()
);
}The proptest strategy arb_paired_scope_pair() produces two paired (SpecChioScope, ChioScope) values built from the same source bits. Because the pair is built once and split into spec and production sides, the two implementations always see equivalent inputs; if the assertion fails, the divergence is real and not an artifact of randomness. The default case count is 256 per harness; bump via PROPTEST_CASES=4096 for nightly-style runs.
The same file pins eight P1 attenuation rules as standalone properties (scope_diff.rs:196-374): empty scope is a subset of any scope, scope subset is reflexive, removing a grant produces a subset, removing an operation produces a subset, reducing max_invocations produces a subset, wildcard tool name subsumes any specific tool name, different servers never produce a subset, and subset is transitive across a child <= parent <= grandparent chain. Each is asserted on the spec side independently of the production side, so a regression in either implementation surfaces as a different failure mode: the _spec_matches_impl tests catch drift, and the standalone properties catch the spec drifting from its own intent.
How this catches drift in practice: imagine a future change to ChioScope::is_subset_of in chio-core-types that accidentally accepts a wider child when parent.max_invocations is None. The Lean spec and the spec-side SpecToolGrant::is_subset_of in src/spec.rs have not changed; the Aeneas extraction has not been re-run yet; the Kani harness happens to use a different concrete shape that does not hit the bug. proptest hands scope_subset_spec_matches_impl an input where the spec returns false and the new production code returns true; the harness fails, proptest shrinks toward a minimal counterexample, and the regression seed is committed under tests/scope_diff.proptest-regressions. The differential lane runs on every PR, so a drift like this is flagged at review time rather than after a release.
Back to the P1 Tour for the wrap-up across all five stations.
Next
- Aeneas Pipeline · the proof-side lane for the same surfaces.
- Lean 4 Proofs · the proof structure that the differential tests complement.
- Kani Harnesses · the bounded model-checking lane on the same Rust paths.