Chio/Docs

Backup & Disaster Recovery

Chio is a system of record for signed receipts. Lose the receipt store and you lose the audit trail. Lose the signing key custody and you lose the ability to mint new receipts under the same identity. Lose the capability authority and your edge kernels revert to default-deny. This page covers what to back up, in what tier, and how the Merkle checkpoint chain stitches a restored store back to the rest of the world.

Source

Verified against crates/chio-store-sqlite/src/receipt_store/bootstrap.rs, crates/chio-store-sqlite/src/receipt_store/evidence_retention.rs, crates/chio-kernel/src/checkpoint.rs, and crates/chio-control-plane/src/lib.rs.

What Needs Backing Up

Five durable surfaces. Treat each one as a separate backup target with its own RPO and retention policy.

SurfaceStorageWhy it matters
Receipt storeSQLite, WAL modeThe signed audit log. Append-only, sequence-indexed.
Kernel signing keySeed file, KMS, HSM, or TEE-boundWithout it, no further receipts under the current identity.
Trusted issuer setEnv / config, replicated by control planeDetermines which capabilities verify after restore.
Capability authority stateauth.sqlite in trust-control nodesIssued capabilities, lineage, and revocation list.
Federation peer setPinned in federation configDetermines whose checkpoints and revocation deltas the kernel honors.
Policy / HushSpec filesFilesystem, version controlReproducing kernel behavior requires the exact policy bytes.

Backup Tiers

Match the backup mechanism to how quickly you must recover.

Hot: continuous receipt replication

For an HA trust-control deployment, follower nodes already replicate the receipt store and authority store on a short interval (default --cluster-sync-interval-ms 500). A follower is, in effect, a hot backup with sub-second lag and no separate ingestion path. Promote a follower if the leader fails. See /docs/deployment/trust-control-plane.

Warm: SQLite snapshots + WAL

The receipt store opens with PRAGMA journal_mode = WAL and PRAGMA synchronous = FULL (from receipt_store/bootstrap.rs). Warm backups use SQLite's online backup API to copy a consistent snapshot of the database file plus the WAL. Point-in-time recovery between snapshots is possible by replaying the WAL forward.

bash
# Online consistent snapshot (no need to stop the writer).
$ sqlite3 /var/lib/chio/receipts.sqlite \
    ".backup '/backups/receipts-$(date -u +%Y%m%dT%H%M%SZ).sqlite'"

# Verify the snapshot is intact and writable as read-only.
$ sqlite3 /backups/receipts-*.sqlite "PRAGMA integrity_check;"

Snapshot the WAL too if needed

SQLite's .backup command produces a self-contained snapshot that already accounts for the WAL. If you choose instead to copy the database file at the filesystem level, copy the -wal and -shm sidecar files too, and consider running PRAGMA wal_checkpoint(TRUNCATE) to fold the WAL into the main file before the copy.

Cold: encrypted archive

For long-term retention, encrypt the snapshot at the application layer and ship it to object storage with versioning enabled. Cold tier accepts a longer RPO; the typical pattern is a nightly encrypted snapshot on a 90-day or 365-day retention schedule.


Receipt Store Mechanics

The receipt store is a single SQLite database file (per kernel node, or replicated across trust-control followers). The bootstrap PRAGMAs come straight from receipt_store/bootstrap.rs:

bash
PRAGMA journal_mode = WAL;
PRAGMA synchronous = FULL;
PRAGMA busy_timeout = 5000;

Implications for backup:

  • WAL-mode reads do not block writes. Online backup is safe even at peak write rate.
  • synchronous = FULL means committed receipts are durable on disk before the kernel acks the operation. A warm backup that includes the WAL captures every receipt the kernel has acknowledged.
  • Page count and page size are PRAGMA-readable. The retention path uses PRAGMA page_count and PRAGMA page_size to compute the database size without a filesystem stat ( consistent in WAL mode). Useful for capacity-planning a backup volume.
  • Retention archives. The evidence retention path detaches archived receipts and runs PRAGMA wal_checkpoint(TRUNCATE) after the detach. If you back up immediately after a retention pass, you capture a freshly checkpointed file.

Merkle Checkpoints as DR Artifacts

The kernel periodically batches receipts into a KernelCheckpointBody (schema chio.checkpoint_statement.v1) that commits to a Merkle root over the batch. Each checkpoint signs:

  • a monotonic checkpoint_seq;
  • the receipt sequence range it covers (batch_start_seq through batch_end_seq);
  • the Merkle tree_size andmerkle_root;
  • the issued_at Unix timestamp;
  • the kernel public key in use; and
  • when this checkpoint extends a prior batch, the SHA-256 of the immediately preceding checkpoint body in previous_checkpoint_sha256.

That last field is the DR-critical one. Because every checkpoint after the first commits to its predecessor, the chain of checkpoints is itself an integrity proof. A relying party that holds an old anchored checkpoint can verify that any later checkpoint extends it, and that any individual receipt in the covered range was part of the batch (via ReceiptInclusionProof).

rendering…
Each kernel checkpoint commits to a Merkle root over a batch of receipts and to the SHA-256 of the previous checkpoint. A restored receipt store re-stitches to the chain by emitting the next checkpoint after restore, with previous_checkpoint_sha256 referencing the last pre-failure checkpoint.

After a restore from snapshot, the kernel emits its next checkpoint with previous_checkpoint_sha256 referencing the last pre-failure checkpoint. Verifiers see one unbroken chain. Receipts produced between the last checkpoint and the snapshot point are covered by the next checkpoint that runs after restore.


Federation-Aware Recovery

A federated chio kernel pins peer public keys and accepts revocation deltas and checkpoint statements from those peers. After a restore, the federation has to re-establish three things:

  • Re-handshake. The restored node announces its current checkpoint head; peers announce theirs; both sides agree on a fresh exchange schedule.
  • Re-pin keys. The pinned peer key set is part of config. Confirm it matches the peer's current advertised public key before accepting new statements.
  • Catch up on revocation. Pull the peer's revocation log from the last-known-acknowledged offset forward. Until catch-up completes, the kernel runs in a bounded-staleness mode where newly issued capabilities still verify but recently revoked tokens may not yet be filtered.

Capability Authority Recovery

The trust-control authority store (auth.sqlite) holds issued capabilities, their lineage chain, and the active revocation list. After a restore the kernel runs in a bounded-staleness window for revocations: the restored revocation list is current as of the snapshot time, plus any deltas pulled during catch-up.

  • Revocation freshness budget. Plan for the staleness window when sizing your snapshot cadence. A 24-hour-old snapshot means up to 24 hours of revocation deltas need to be replayed from peers or from the control-plane log before the kernel is fully current.
  • Capability lineage is preserved. Issued capabilities and their delegation lineage are persisted in the same database, so restored capabilities continue to verify against the trusted issuer key set.
  • Re-issue is rare. Capability tokens that were valid at snapshot time stay valid after restore until they expire or get revoked. You almost never re-issue as part of DR.

Recovery Objectives

Realistic numbers per tier. RPO is data loss budget; RTO is time-to-running.

TierRPO targetRTO target
Hot (HA follower promote)Sub-secondSeconds (failover)
Warm (online snapshot)Minutes (snapshot cadence)Single-digit minutes
Cold (encrypted archive)Hours to a dayTens of minutes to hours

These are starting points. Validate them with restore drills.


Runbook

Backup Verification

A snapshot you have not verified is a snapshot that does not exist. Run integrity checks on every snapshot at write time and a smoke restore on a sampled cadence.

bash
# Per-snapshot integrity check.
$ sqlite3 "$SNAPSHOT" "PRAGMA integrity_check;"
ok

# Verify the receipt count is non-decreasing across consecutive snapshots.
$ sqlite3 "$SNAPSHOT" "SELECT MAX(seq) FROM chio_tool_receipts;"
1834720

# Verify the latest checkpoint chain head.
$ sqlite3 "$SNAPSHOT" \
    "SELECT MAX(checkpoint_seq) FROM kernel_checkpoints;"
4928

Restore Drill Cadence

Quarterly is the minimum sane cadence. Monthly is better for any deployment where receipts have audit weight.

  1. Pick a recent snapshot from the backup tier you are testing.
  2. Restore into an isolated environment and bring up a chio kernel against the restored receipt store.
  3. Issue one signed receipt against the restored kernel; confirm the next checkpoint extends the pre-failure chain via previous_checkpoint_sha256.
  4. Pull the peer revocation deltas from a federation partner if this is a federated deployment; confirm the staleness window closes within the documented budget.
  5. Tear down the drill environment and capture metrics: RTO, steps that surprised you, and any drift between runbook and reality.

Snapshot Retention

A common starting point: hourly snapshots retained 7 days, daily snapshots retained 90 days, monthly snapshots retained 7 years. Tune to your audit obligation; the chio receipt log is often the document of record for AI-mediated decisions and the retention will be governed by external rules.


Encryption At Rest

Three layers of encryption are common in production:

  • Storage-layer encryption. KMS-backed encrypted disk volumes (EBS, Persistent Disk, Managed Disks) cover the receipt store at rest with no chio configuration.
  • Application-layer encryption. The encrypted_blob module in chio-store-sqlite is the helper for tenant-isolated encrypted blobs inside the store, with per-tenant key derivation. Useful for fields where the host database operator should not be in the trust envelope.
  • Backup-layer encryption. Encrypt snapshots at write time with a separate KMS key before shipping them off-host. This keeps the backup-storage provider out of the trust envelope.

For high-sensitivity tenants, double-encrypt: storage + backup with separate KMS keys, and (optionally) per-tenant application-layer encryption on top.


Multi-Region

Two patterns scale well:

  • Async cross-region replication. A regional follower node in trust-control replicates from the home-region leader on the cluster sync interval. Latency adds to the RPO; the follower is read-only outside its home region until promoted. Best when one region is clearly primary.
  • Per-region kernels with federation. Each region runs an independent kernel with its own receipt store, signing key, and capability authority. Federation links the regions so receipts and revocations cross-verify. Best when each region is fully independent and the unit of economic settlement is per-region.

The trade-off is cost vs RPO. Async replication is cheaper but leaves a window where the disaster region's most recent receipts have not reached the standby; per-region kernels with federation pay the full per-region storage cost but lose only cross-region traffic on a regional outage.


Worked Example

A realistic small-scale deployment: nightly encrypted snapshot to S3, weekly anchored Merkle checkpoint, 90-day retention, quarterly restore drill.

bash
# /usr/local/bin/chio-backup.sh - run nightly via cron
set -euo pipefail

DB="/var/lib/chio/receipts.sqlite"
BUCKET="s3://chio-backups/prod"
STAMP="$(date -u +%Y%m%dT%H%M%SZ)"
SNAP="/tmp/receipts-$STAMP.sqlite"

# 1. Online snapshot with consistency check.
sqlite3 "$DB" ".backup '$SNAP'"
sqlite3 "$SNAP" "PRAGMA integrity_check;" | grep -qx "ok"

# 2. Encrypt with KMS-managed key (server-side encryption on upload).
aws s3 cp "$SNAP" "$BUCKET/receipts-$STAMP.sqlite" \
    --sse aws:kms \
    --sse-kms-key-id "$CHIO_BACKUP_KMS_KEY_ID"

# 3. Drop the local copy.
rm -f "$SNAP"

# 4. Apply the 90-day retention rule via S3 lifecycle policy (configured
#    once per bucket; not part of the per-night script).
echo "ok: $STAMP"
bash
# /usr/local/bin/chio-checkpoint-anchor.sh - run weekly
set -euo pipefail

# 1. Force a fresh checkpoint at the kernel.
chio admin checkpoint create

# 2. Export the latest checkpoint statement for external anchoring.
chio admin checkpoint export --latest \
    > "/var/lib/chio/anchors/checkpoint-$(date -u +%Y%m%dT%H%M%SZ).json"

# 3. Anchor the checkpoint root in the trust anchor of choice
#    (Sigstore Rekor, internal transparency log, or S3 with object lock).
chio anchor publish --target rekor \
    "/var/lib/chio/anchors/checkpoint-*.json"

Restore drill quarterly

Quarterly, take the most recent S3 snapshot, restore into a clean kernel, issue a single test receipt, and confirm the next checkpoint chains to the last pre-failure checkpoint. Record RTO and any deviations from the runbook.

  • Secrets & Signing Keys for key custody patterns and rotation, including how the checkpoint chain crosses a key rotation boundary.
  • Observability for monitoring backup health: snapshot age, restore drill metrics, and federation peer freshness.
  • Trust Control Plane for the HA follower model that constitutes the hot tier.
  • Receipts for the receipt schema and how the canonical body is signed.
  • Bilateral Receipts for cross-organization receipt continuity, where the Merkle checkpoint chain is the cross-party integrity proof.
Backup & Disaster Recovery · Chio Docs