Chio/Docs

Observability

Chio emits three signal classes: signed receipts (the audit trail), tracing spans (the per-request narrative), and Prometheus metrics (the aggregate behavior). Each answers a different question, so each gets its own pipeline. This page is the deployment-side how-to: collector config, scrape rules, the five Grafana dashboards in deploy/dashboards/, and the alert rules an operator wants on day one.

What this page is not

For the metric family catalog (names, label vocabulary, log fields, cardinality limits) read /docs/guard-platform/observability. This page wires those signals into a stack; it does not re-document them.

The Signal Hierarchy

Three signals, three roles. Operators tend to want all three but for very different reasons.

SignalQuestion it answersBackend
ReceiptsWhat did chio decide for this request, and what signed proof do I have?Receipt store (SQLite) + SIEM
TracesWhere did this single request spend its time, and which guards ran?Tempo or Jaeger
MetricsIs the deny rate spiking? Are guards exhausting fuel?Prometheus / OTel metrics

Receipts are the primary audit artifact. Traces and metrics are operational telemetry. If you only deploy one, deploy receipts.


OTel Collector Wiring

Chio components emit OTLP gRPC traces and Prometheus metrics directly. A standard OpenTelemetry Collector is the right aggregation point for everything except receipts. Receipts go to the receipt store first; the SIEM exporter or the OTel receipt exporter pulls from there.

rendering…
Chio components emit OTLP traces to an OTel collector and serve Prometheus metrics on /metrics. Receipts land in the SQLite receipt store first; OTel-shaped traces can also be ingested back as receipts via chio-otel-receipt-exporter.

The collector config below is the minimum that powers the five shipped dashboards. Receivers pick up OTLP traces; processors strip high-cardinality attributes before they hit Prometheus; exporters fan out to Tempo, Loki, and (optionally) the receipt exporter.

yaml
# otel-collector.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024

  # Strip high-cardinality attributes from anything bound for Prometheus.
  # These attributes are safe on spans and logs; they are not safe as
  # Prometheus label values.
  attributes/strip-cardinality:
    actions:
      - key: gen_ai.tool.call.id
        action: delete
      - key: chio.receipt.id
        action: delete
      - key: chio.replay.run_id
        action: delete

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/tempo]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

Cardinality denylist is real

PROMETHEUS_DENIED_ATTRIBUTES is a fixed three-element array. Forwarding any of these as a Prometheus label blows up time-series cardinality and the scrape will fail metric registration.
crates/chio-otel-receipt-exporter/src/denylist.rs
use chio_kernel::otel::{ATTR_CHIO_RECEIPT_ID, ATTR_GEN_AI_TOOL_CALL_ID};

pub const ATTR_CHIO_REPLAY_RUN_ID: &str = "chio.replay.run_id";

pub const PROMETHEUS_DENIED_ATTRIBUTES: [&str; 3] = [
    ATTR_GEN_AI_TOOL_CALL_ID,
    ATTR_CHIO_RECEIPT_ID,
    ATTR_CHIO_REPLAY_RUN_ID,
];

The exporter calls strip_denied_attributes on every span before signing the receipt body (denylist.rs:21-27), so the canonicalized form is stable across re-export. Ingest the same span twice and you get the same receipt body bytes.


The Receipt Exporter Crate

chio-otel-receipt-exporter is the bridge from OTLP trace ingress to the chio receipt store. Each accepted span gets canonicalized, signed, and appended to the configured ReceiptStore. The sanitizer strips the same denied attributes before signing so the canonical form is stable across re-export.

rust
use chio_otel_receipt_exporter::{ReceiptStoreSink, ReceiptStoreSinkConfig};

let config = ReceiptStoreSinkConfig {
    signing_keypair,                       // Ed25519 keypair
    policy_hash: "otel-receipt-exporter".into(),
    default_capability_id: "otel-ingress".into(),
    default_tool_server: "otel-collector".into(),
    default_tool_name: "gen_ai.tool.call".into(),
    tenant_id: Some("acme-prod".into()),
};

let sink = ReceiptStoreSink::new(receipt_store, config);
let summary = sink.export_traces(&otlp_export)?;
println!("appended {} receipts", summary.appended_receipts);

The sink is single-purpose: it takes an OtlpGrpcTraceExport, produces one ChioReceipt per span (with ToolCallAction carrying the sanitized attributes), and appends each receipt through the same ReceiptStore trait the kernel uses. Span identity carries through: trace_id and span_id are validated before signing.


Trace Propagation

Chio respects W3C Trace Context. When the upstream caller passes a traceparent header, the kernel joins that trace and emits its spans as children. The receipt then carries provenance.otel.trace_id and provenance.otel.span_id as fields, which is what makes the Tempo and Jaeger dashboards able to pivot from a receipt to a trace.

That linkage is the M10 attribute lock. The required structured fields on every span and log line are:

  • chio.receipt.id
  • chio.tenant.id
  • chio.policy.ref
  • chio.verdict
  • chio.tee.mode
  • chio.deny.reason
  • chio.guard.outcome
  • provenance.otel.trace_id
  • provenance.otel.span_id
  • redaction_pass_id
  • redaction_elapsed_micros

Metrics Scrape Config

Chio exposes Prometheus metrics over an HTTP /metrics endpoint. The WASM guard runtime publishes a fixed set of families documented in crates/chio-wasm-guards/src/metrics.rs:

  • chio_guard_eval_duration_seconds histogram, labels guard_id, verdict.
  • chio_guard_fuel_consumed_total counter, label guard_id.
  • chio_guard_verdict_total counter, labels guard_id, verdict.
  • chio_guard_deny_total counter, labels guard_id, reason_class.
  • chio_guard_reload_total counter, labels guard_id, outcome.
  • chio_guard_host_call_duration_seconds histogram, labels guard_id, host_fn.
  • chio_guard_module_bytes gauge, labels guard_id, epoch.

Total label cardinality is capped at MAX_GUARD_METRIC_CARDINALITY = 1024; registering more guards than that fails closed with E_GUARD_METRIC_CARDINALITY_EXCEEDED.

yaml
# prometheus.yml
scrape_configs:
  - job_name: 'chio'
    scrape_interval: 15s
    metrics_path: /metrics
    static_configs:
      - targets:
          - chio-edge:9091
          - chio-tee:9091
          - chio-control:9091

The Five Shipped Dashboards

Five Grafana dashboard JSON files ship in deploy/dashboards/. All five expect the M10 attribute lock and use Grafana datasource variables (DS_LOKI, DS_TEMPO, DS_JAEGER) so import works across stacks. The one-command import script lives at the top of deploy/dashboards/README.md.

loki/chio-tee.json

Title: Chio TEE Loki receipt view. Targets Loki. Four panels: TEE records by verdict, TEE mode and verdict density heatmap, raw log stream, and a receipt-to-trace lookup table (top 50 by receipt id, with policy ref, verdict, deny reason). Filterable by receipt id, tenant id, trace id, verdict, and TEE mode.

loki/verdict-drift.json

Title: Chio replay verdict drift. Targets Loki. Filters chio-replay logs to event="verdict_drift" and renders a heatmap of from_verdict × to_verdict, a stacked time series of reason_delta, a top-100 drift records table, and a raw drift log stream.

jaeger/receipt-span-lookup.json

Title: Chio Jaeger receipt span lookup. Targets Jaeger. Pivots from receipt id, trace id, span id, tenant id, and tool name to traces. Four panels: trace search by tags, table of receipt and span lookups, traces-by-verdict time series, and trace-duration time series.

tempo/span-timeline.json

Title: Chio Tempo span timeline. Targets Tempo. The primary panel runs the TraceQL query { span.chio.receipt.id = "$receipt_id" } to fetch every trace that touched a given receipt. Additional panels: a span rows table, a GenAI tool-call rate time series, and a span-duration-by-verdict line chart.

tempo/redaction-latency.json

Title: Chio Tempo redaction latency. Targets Tempo. Tracks redaction-pass latency by redaction_pass_id and chio.guard.outcome. Four panels: p95 latency, latency distribution histogram, redaction pass spans table, and span rate.


Alert Rules

The following alert rules are the operator's minimum kit. All assume Prometheus, the metric families above, and the M10 attribute lock.

Deny Rate Spike

promql
# A 5-minute deny rate that doubles versus the prior hour.
(
  sum by (guard_id) (rate(chio_guard_deny_total[5m]))
)
> on (guard_id)
2 *
(
  sum by (guard_id) (rate(chio_guard_deny_total[1h] offset 5m))
)

Fuel Exhaustion Tail Latency

promql
# p99 guard evaluation duration over 250ms (top of the histogram bucket
# range from EVAL_DURATION_BUCKETS_SECONDS).
histogram_quantile(
  0.99,
  sum by (guard_id, le) (
    rate(chio_guard_eval_duration_seconds_bucket[5m])
  )
) > 0.25

Hot-Reload Canary Failed

promql
# Any non-zero canary-failure count over five minutes.
sum(
  increase(chio_guard_reload_total{outcome="canary_failed"}[5m])
) > 0

Deny Reason Class Surge

promql
# A surge in a single reason_class (pii, secret, prompt_injection, ...)
sum by (reason_class) (rate(chio_guard_deny_total[5m]))
> 5
unless on (reason_class)
sum by (reason_class) (rate(chio_guard_deny_total[1h] offset 5m)) > 5

Two more alerts come from log queries rather than metrics, so they live in your Loki ruler instead of Prometheus:

  • Verdict drift over threshold. The verdict-drift dashboard query, but as a rule: sum by (chio_policy_ref) (count_over_time({service_name="chio-replay"} | json | event="verdict_drift" [10m])) over a configured threshold per policy ref.
  • Federation peer staleness. Watch for absence: alert when a peer's last heartbeat log line is older than the configured peer freshness window.

SIEM Integration

For long-term retention and security workflow, receipts go to a SIEM rather than a metrics or trace backend. The exporter is independent of the kernel and uses a sequence cursor so it never loses data on retry. Read /docs/deployment/siem-export for the design, configuration, and OCSF field mapping.


Worked Example: docker-compose

A minimal stack that runs an OTel collector, Prometheus, Loki, Tempo, and Grafana, and imports the five dashboards on startup. Adapt the chio service definition to your deployment shape.

yaml
# docker-compose.yaml
services:
  chio-edge:
    image: chio-sidecar:latest
    environment:
      OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
      OTEL_SERVICE_NAME: chio-edge
    ports:
      - "9091:9091"      # /metrics

  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector.yaml"]
    volumes:
      - ./otel-collector.yaml:/etc/otel-collector.yaml:ro
    ports:
      - "4317:4317"      # OTLP gRPC
      - "4318:4318"      # OTLP HTTP

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"

  tempo:
    image: grafana/tempo:latest
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./tempo.yaml:/etc/tempo.yaml:ro

  grafana:
    image: grafana/grafana:latest
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
    ports:
      - "3000:3000"
    volumes:
      - ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/ds.yaml:ro

Once Grafana is up, the dashboard import is one command from the chio repo root:

bash
$ find deploy/dashboards -name '*.json' -print0 | \
    while IFS= read -r -d '' dashboard; do
      jq -n --argjson dashboard "$(cat "$dashboard")" \
        '{dashboard: $dashboard, overwrite: true}' \
      | curl -fsS -H 'Content-Type: application/json' \
          -X POST http://admin:admin@127.0.0.1:3000/api/dashboards/db -d @- \
          >/dev/null
    done

Confirm imports

After the loop completes, browse to http://localhost:3000 and confirm five dashboards appear under the chio tag. If a panel renders empty, the most likely cause is a missing M10 attribute on your spans or logs; cross-check against deploy/dashboards/README.md.

  • Guard-platform Observability for the metric family catalog, label vocabulary, and log-line conventions every chio crate emits.
  • SIEM Export for receipt forwarding to Splunk HEC and Elasticsearch with dead-letter handling.
  • TEE Deployment for the chio-tee Loki dashboard and TEE-specific signals.
  • Receipt Dashboard for the receipt-first operator UI that is separate from Grafana.
Observability · Chio Docs