Observability
Chio emits three signal classes: signed receipts (the audit trail), tracing spans (the per-request narrative), and Prometheus metrics (the aggregate behavior). Each answers a different question, so each gets its own pipeline. This page is the deployment-side how-to: collector config, scrape rules, the five Grafana dashboards in deploy/dashboards/, and the alert rules an operator wants on day one.
What this page is not
The Signal Hierarchy
Three signals, three roles. Operators tend to want all three but for very different reasons.
| Signal | Question it answers | Backend |
|---|---|---|
| Receipts | What did chio decide for this request, and what signed proof do I have? | Receipt store (SQLite) + SIEM |
| Traces | Where did this single request spend its time, and which guards ran? | Tempo or Jaeger |
| Metrics | Is the deny rate spiking? Are guards exhausting fuel? | Prometheus / OTel metrics |
Receipts are the primary audit artifact. Traces and metrics are operational telemetry. If you only deploy one, deploy receipts.
OTel Collector Wiring
Chio components emit OTLP gRPC traces and Prometheus metrics directly. A standard OpenTelemetry Collector is the right aggregation point for everything except receipts. Receipts go to the receipt store first; the SIEM exporter or the OTel receipt exporter pulls from there.
The collector config below is the minimum that powers the five shipped dashboards. Receivers pick up OTLP traces; processors strip high-cardinality attributes before they hit Prometheus; exporters fan out to Tempo, Loki, and (optionally) the receipt exporter.
# otel-collector.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 1024
# Strip high-cardinality attributes from anything bound for Prometheus.
# These attributes are safe on spans and logs; they are not safe as
# Prometheus label values.
attributes/strip-cardinality:
actions:
- key: gen_ai.tool.call.id
action: delete
- key: chio.receipt.id
action: delete
- key: chio.replay.run_id
action: delete
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/tempo]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]Cardinality denylist is real
PROMETHEUS_DENIED_ATTRIBUTES is a fixed three-element array. Forwarding any of these as a Prometheus label blows up time-series cardinality and the scrape will fail metric registration.use chio_kernel::otel::{ATTR_CHIO_RECEIPT_ID, ATTR_GEN_AI_TOOL_CALL_ID};
pub const ATTR_CHIO_REPLAY_RUN_ID: &str = "chio.replay.run_id";
pub const PROMETHEUS_DENIED_ATTRIBUTES: [&str; 3] = [
ATTR_GEN_AI_TOOL_CALL_ID,
ATTR_CHIO_RECEIPT_ID,
ATTR_CHIO_REPLAY_RUN_ID,
];The exporter calls strip_denied_attributes on every span before signing the receipt body (denylist.rs:21-27), so the canonicalized form is stable across re-export. Ingest the same span twice and you get the same receipt body bytes.
The Receipt Exporter Crate
chio-otel-receipt-exporter is the bridge from OTLP trace ingress to the chio receipt store. Each accepted span gets canonicalized, signed, and appended to the configured ReceiptStore. The sanitizer strips the same denied attributes before signing so the canonical form is stable across re-export.
use chio_otel_receipt_exporter::{ReceiptStoreSink, ReceiptStoreSinkConfig};
let config = ReceiptStoreSinkConfig {
signing_keypair, // Ed25519 keypair
policy_hash: "otel-receipt-exporter".into(),
default_capability_id: "otel-ingress".into(),
default_tool_server: "otel-collector".into(),
default_tool_name: "gen_ai.tool.call".into(),
tenant_id: Some("acme-prod".into()),
};
let sink = ReceiptStoreSink::new(receipt_store, config);
let summary = sink.export_traces(&otlp_export)?;
println!("appended {} receipts", summary.appended_receipts);The sink is single-purpose: it takes an OtlpGrpcTraceExport, produces one ChioReceipt per span (with ToolCallAction carrying the sanitized attributes), and appends each receipt through the same ReceiptStore trait the kernel uses. Span identity carries through: trace_id and span_id are validated before signing.
Trace Propagation
Chio respects W3C Trace Context. When the upstream caller passes a traceparent header, the kernel joins that trace and emits its spans as children. The receipt then carries provenance.otel.trace_id and provenance.otel.span_id as fields, which is what makes the Tempo and Jaeger dashboards able to pivot from a receipt to a trace.
That linkage is the M10 attribute lock. The required structured fields on every span and log line are:
chio.receipt.idchio.tenant.idchio.policy.refchio.verdictchio.tee.modechio.deny.reasonchio.guard.outcomeprovenance.otel.trace_idprovenance.otel.span_idredaction_pass_idredaction_elapsed_micros
Metrics Scrape Config
Chio exposes Prometheus metrics over an HTTP /metrics endpoint. The WASM guard runtime publishes a fixed set of families documented in crates/chio-wasm-guards/src/metrics.rs:
chio_guard_eval_duration_secondshistogram, labelsguard_id, verdict.chio_guard_fuel_consumed_totalcounter, labelguard_id.chio_guard_verdict_totalcounter, labelsguard_id, verdict.chio_guard_deny_totalcounter, labelsguard_id, reason_class.chio_guard_reload_totalcounter, labelsguard_id, outcome.chio_guard_host_call_duration_secondshistogram, labelsguard_id, host_fn.chio_guard_module_bytesgauge, labelsguard_id, epoch.
Total label cardinality is capped at MAX_GUARD_METRIC_CARDINALITY = 1024; registering more guards than that fails closed with E_GUARD_METRIC_CARDINALITY_EXCEEDED.
# prometheus.yml
scrape_configs:
- job_name: 'chio'
scrape_interval: 15s
metrics_path: /metrics
static_configs:
- targets:
- chio-edge:9091
- chio-tee:9091
- chio-control:9091The Five Shipped Dashboards
Five Grafana dashboard JSON files ship in deploy/dashboards/. All five expect the M10 attribute lock and use Grafana datasource variables (DS_LOKI, DS_TEMPO, DS_JAEGER) so import works across stacks. The one-command import script lives at the top of deploy/dashboards/README.md.
loki/chio-tee.json
Title: Chio TEE Loki receipt view. Targets Loki. Four panels: TEE records by verdict, TEE mode and verdict density heatmap, raw log stream, and a receipt-to-trace lookup table (top 50 by receipt id, with policy ref, verdict, deny reason). Filterable by receipt id, tenant id, trace id, verdict, and TEE mode.
loki/verdict-drift.json
Title: Chio replay verdict drift. Targets Loki. Filters chio-replay logs to event="verdict_drift" and renders a heatmap of from_verdict × to_verdict, a stacked time series of reason_delta, a top-100 drift records table, and a raw drift log stream.
jaeger/receipt-span-lookup.json
Title: Chio Jaeger receipt span lookup. Targets Jaeger. Pivots from receipt id, trace id, span id, tenant id, and tool name to traces. Four panels: trace search by tags, table of receipt and span lookups, traces-by-verdict time series, and trace-duration time series.
tempo/span-timeline.json
Title: Chio Tempo span timeline. Targets Tempo. The primary panel runs the TraceQL query { span.chio.receipt.id = "$receipt_id" } to fetch every trace that touched a given receipt. Additional panels: a span rows table, a GenAI tool-call rate time series, and a span-duration-by-verdict line chart.
tempo/redaction-latency.json
Title: Chio Tempo redaction latency. Targets Tempo. Tracks redaction-pass latency by redaction_pass_id and chio.guard.outcome. Four panels: p95 latency, latency distribution histogram, redaction pass spans table, and span rate.
Alert Rules
The following alert rules are the operator's minimum kit. All assume Prometheus, the metric families above, and the M10 attribute lock.
Deny Rate Spike
# A 5-minute deny rate that doubles versus the prior hour.
(
sum by (guard_id) (rate(chio_guard_deny_total[5m]))
)
> on (guard_id)
2 *
(
sum by (guard_id) (rate(chio_guard_deny_total[1h] offset 5m))
)Fuel Exhaustion Tail Latency
# p99 guard evaluation duration over 250ms (top of the histogram bucket
# range from EVAL_DURATION_BUCKETS_SECONDS).
histogram_quantile(
0.99,
sum by (guard_id, le) (
rate(chio_guard_eval_duration_seconds_bucket[5m])
)
) > 0.25Hot-Reload Canary Failed
# Any non-zero canary-failure count over five minutes.
sum(
increase(chio_guard_reload_total{outcome="canary_failed"}[5m])
) > 0Deny Reason Class Surge
# A surge in a single reason_class (pii, secret, prompt_injection, ...)
sum by (reason_class) (rate(chio_guard_deny_total[5m]))
> 5
unless on (reason_class)
sum by (reason_class) (rate(chio_guard_deny_total[1h] offset 5m)) > 5Two more alerts come from log queries rather than metrics, so they live in your Loki ruler instead of Prometheus:
- Verdict drift over threshold. The
verdict-driftdashboard query, but as a rule:sum by (chio_policy_ref) (count_over_time({service_name="chio-replay"} | json | event="verdict_drift" [10m]))over a configured threshold per policy ref. - Federation peer staleness. Watch for absence: alert when a peer's last heartbeat log line is older than the configured peer freshness window.
SIEM Integration
For long-term retention and security workflow, receipts go to a SIEM rather than a metrics or trace backend. The exporter is independent of the kernel and uses a sequence cursor so it never loses data on retry. Read /docs/deployment/siem-export for the design, configuration, and OCSF field mapping.
Worked Example: docker-compose
A minimal stack that runs an OTel collector, Prometheus, Loki, Tempo, and Grafana, and imports the five dashboards on startup. Adapt the chio service definition to your deployment shape.
# docker-compose.yaml
services:
chio-edge:
image: chio-sidecar:latest
environment:
OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4317
OTEL_SERVICE_NAME: chio-edge
ports:
- "9091:9091" # /metrics
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector.yaml"]
volumes:
- ./otel-collector.yaml:/etc/otel-collector.yaml:ro
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
tempo:
image: grafana/tempo:latest
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./tempo.yaml:/etc/tempo.yaml:ro
grafana:
image: grafana/grafana:latest
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
ports:
- "3000:3000"
volumes:
- ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/ds.yaml:roOnce Grafana is up, the dashboard import is one command from the chio repo root:
$ find deploy/dashboards -name '*.json' -print0 | \
while IFS= read -r -d '' dashboard; do
jq -n --argjson dashboard "$(cat "$dashboard")" \
'{dashboard: $dashboard, overwrite: true}' \
| curl -fsS -H 'Content-Type: application/json' \
-X POST http://admin:admin@127.0.0.1:3000/api/dashboards/db -d @- \
>/dev/null
doneConfirm imports
http://localhost:3000 and confirm five dashboards appear under the chio tag. If a panel renders empty, the most likely cause is a missing M10 attribute on your spans or logs; cross-check against deploy/dashboards/README.md.Related Reading
- Guard-platform Observability for the metric family catalog, label vocabulary, and log-line conventions every chio crate emits.
- SIEM Export for receipt forwarding to Splunk HEC and Elasticsearch with dead-letter handling.
- TEE Deployment for the chio-tee Loki dashboard and TEE-specific signals.
- Receipt Dashboard for the receipt-first operator UI that is separate from Grafana.