Jailbreak & Injection Guards

Not in the default pipeline

These guards are opt-in by design. GuardPipeline::default_pipeline() does not include them. Reasons: regex and feature-extraction overhead on cache-miss, the false-positive risk of LLM-shaped detectors, and the volatility of the heuristic table as attacker techniques shift. Register them via kernel.add_guard(Box::new(JailbreakGuard::default())) when you accept those tradeoffs.

Heuristic regex signals contribute 0.70 to the score; statistical features and the linear model contribute the remaining weight. The blended score denies at the default threshold of 0.75.

JailbreakGuard

Source: crates/guards/chio-guards/src/jailbreak.rs (guard wrapper) and jailbreak_detector.rs (detector engine). Guard name: jailbreak.

Struct

crates/guards/chio-guards/src/jailbreak.rs

#[derive(Clone, Debug)]
pub struct JailbreakGuardConfig {
    pub threshold: f32,
    pub layer_weights: LayerWeights,
    pub fingerprint_dedup_capacity: usize,
    pub detector: DetectorConfig,
}

pub struct JailbreakGuard {
    config: JailbreakGuardConfig,
    detector: JailbreakDetector,
    dedup: Mutex<LruCache<String, bool>>,
}

pub const DEFAULT_FINGERPRINT_CAPACITY: usize = 1024;

Defaults

Knob	Type	Default	Source constant
`threshold`	`f32`	`0.75`	`DEFAULT_DENY_THRESHOLD`
`layer_weights.heuristic`	`f32`	`0.70`	`LayerWeights::default()`
`layer_weights.statistical`	`f32`	`0.10`	`LayerWeights::default()`
`layer_weights.ml`	`f32`	`0.20`	`LayerWeights::default()`
`layer_weights.heuristic_divisor`	`f32`	`1.0`	`LayerWeights::default()`
`fingerprint_dedup_capacity`	`usize`	`1024`	`DEFAULT_FINGERPRINT_CAPACITY`
`detector.max_scan_bytes`	`usize`	`65536` (64 KiB)	`DEFAULT_MAX_SCAN_BYTES`

Heuristic signals (layer 1)

Patterns operate over canonicalized text (lowercased ASCII, zero-widths stripped, homoglyphs folded, separator runs collapsed). Each pattern fires a stable signal ID:

Signal ID	Category	Weight	Pattern intent
`jb_ignore_policy`	`AuthorityConfusion`	`0.9`	"ignore/disregard/bypass/override/disable" near "policy/rules/safety/guardrails/safeguards".
`jb_dan_unfiltered`	`RolePlay`	`0.9`	DAN, do-anything-now, evil-confidant, unfiltered, unrestricted, jailbreak.
`jb_system_prompt_extraction`	`InstructionExtraction`	`0.95`	"reveal/show/print/output/leak" near "system prompt", "developer instructions", "hidden instructions".
`jb_role_change`	`RolePlay`	`0.7`	"you are now", "act as", "pretend to be", "roleplay as", "from now on you are".
`jb_encoded_payload`	`EncodingAttack`	`0.6`	base64, rot13, url-encode, decode-this, decode-the-following.
`jb_developer_mode`	`AuthorityConfusion`	`0.8`	developer/debug/god/admin/sudo mode, enable developer/debug mode.

Statistical signals (layer 2)

Each fires a stable ID under category AdversarialSuffix. Each contributes a fixed 0.2 to the layer score. The layer is bounded in[0.0, 1.0] for up to five signals.

Signal ID	Threshold default	Constant	Test
`stat_punctuation_ratio_high`	`0.35`	`DEFAULT_PUNCT_RATIO_THRESHOLD`	non-whitespace symbol fraction >= threshold
`stat_char_entropy_high`	`4.8`	`DEFAULT_ENTROPY_THRESHOLD`	Shannon bits/char of non-whitespace ASCII
`stat_long_symbol_run`	`12`	`DEFAULT_SYMBOL_RUN_MIN`	contiguous run of non-alnum non-whitespace chars of this length
`stat_low_shingle_uniqueness`	`0.35`	`DEFAULT_SHINGLE_UNIQUENESS_THRESHOLD`	char-3-gram uniqueness ratio < threshold
`stat_zero_width_obfuscation`	> 0	(no constant)	zero-width codepoints in original input (counted before canonicalization)

Default shingle size is DEFAULT_SHINGLE_N = 3 (char 3-grams).

ML layer (layer 3)

A small linear model with sigmoid activation. Inputs are 0/1 feature flags from layers 1 and 2 plus the (continuous) punctuation ratio. Default weights from LinearModel::default():

crates/guards/chio-guards/src/jailbreak_detector.rs

bias: -2.0,
w_ignore_policy: 2.5,
w_dan: 2.0,
w_role_change: 1.5,
w_prompt_extraction: 2.2,
w_encoded: 1.0,
w_developer_mode: 2.0,
w_punct: 2.0,
w_symbol_run: 1.5,
w_low_shingle_uniqueness: 1.2,
w_zero_width: 1.0,

Output is sigmoid(bias + sum(w_i * x_i)) clamped to [0.0, 1.0]. With default bias -2.0, benign input produces ML score around 0.12. A multi-flag attack saturates the output toward 1.0.

Layer blend

crates/guards/chio-guards/src/jailbreak_detector.rs

let h_div = weights.heuristic_divisor.max(f32::EPSILON);
let h_clamped = (heuristic_score / h_div).clamp(0.0, 1.0);
let s_clamped = statistical_score.clamp(0.0, 1.0);
let score = (h_clamped * weights.heuristic
    + s_clamped * weights.statistical
    + ml_score * weights.ml)
    .clamp(0.0, 1.0);

With defaults, a single dominant heuristic at weight 0.95 contributes 0.95 * 0.70 = 0.665 on its own; layered with even a weak ML reinforcement the request clears the 0.75 deny threshold. Multi-pattern attacks reach saturation cleanly.

Fingerprint dedup

The guard hashes the canonicalized text with SHA-256 and stores the first 8 bytes (16 hex chars) in a bounded LRU keyed by that fingerprint. On hit, if the prior verdict was deny, the guard returns deny without re-running detection. Capacity defaults to 1024; 0 is internally rounded up to NonZeroUsize::MIN.

Failure modes

Empty / whitespace-only text :: allow.
Heuristic regex compile failure :: tracing::error! and the signal is dropped from the pattern table for the process lifetime. Detection continues without it.
Mutex poisoning :: deny (Verdict::Deny returned directly from evaluate_text).
The detector port preserves the upstream signal IDs (jb_ignore_policy, etc.) intentionally so log-analysis tooling that knows the upstream taxonomy continues to work.

PromptInjectionGuard

Source: crates/guards/chio-guards/src/prompt_injection.rs. Guard name: prompt-injection. Six regex signals over canonicalized input. Score is the sum of fired signal weights; deny when total >= threshold.

Struct

crates/guards/chio-guards/src/prompt_injection.rs

#[derive(Clone, Debug)]
pub struct PromptInjectionConfig {
    pub score_threshold: f32,
    pub max_scan_bytes: usize,
    pub fingerprint_capacity: usize,
}

pub struct PromptInjectionGuard {
    config: PromptInjectionConfig,
    patterns: Patterns,
    dedup: Mutex<LruCache<String, bool>>,
}

Defaults

Knob	Default	Constant
`score_threshold`	`0.8`	`DEFAULT_SCORE_THRESHOLD`
`max_scan_bytes`	`65536` (64 KiB)	`DEFAULT_MAX_SCAN_BYTES`
`fingerprint_capacity`	`1024`	`DEFAULT_FINGERPRINT_CAPACITY`

Signals

Variant	Stable ID	Default weight	Catches
`InstructionOverride`	`instruction_override`	`0.9`	"ignore previous instructions"-style override.
`RoleInjection`	`role_injection`	`0.4`	"you are now", `<\|assistant\|>`.
`DelimiterInjection`	`delimiter_injection`	`0.3`	System-role delimiter tokens (`[system]`, `[/INST]`, etc.).
`OutputHijack`	`output_hijack`	`0.3`	"respond with exactly", "output only".
`ToolChainHijack`	`tool_chain_hijack`	`0.3`	"call tool X with", "use function X to".
`ExfiltrationFraming`	`exfiltration_framing`	`0.5`	URLs / email / POST language near data tokens.

At the default 0.8 threshold, InstructionOverride alone clears the bar. Other signals require corroboration; e.g., role injection plus exfiltration framing reaches 0.9.

Failure modes

Empty input :: allow.
Mutex poisoning :: Verdict::Deny (fail-closed).
Unrecognized ToolAction variant :: allow (the guard does not apply).

Text candidate extraction

Both guards share an extraction routine:

Pull strings from the action variant: code body for CodeExecution, query for DatabaseQuery, endpoint for ExternalApiCall.
Recurse into arguments JSON and append every string leaf.
Drop empty / whitespace-only strings.

Each candidate runs through the dedup cache, then through detection. First deny short-circuits the candidate loop.

Composition

rust

use chio_guards::{JailbreakGuard, PromptInjectionGuard};

let mut pipeline = chio_guards::GuardPipeline::default_pipeline();

// Cheap signal first.
pipeline.add(Box::new(PromptInjectionGuard::default()));

// Heavier multi-layer detector second.
pipeline.add(Box::new(JailbreakGuard::default()));

Both guards share the same canonicalization pass and the same fingerprint shape (first 8 bytes of SHA-256), so running them back-to-back on the same input uses the same cache-key layout. The guards still maintain separate LRU caches.

When to enable jailbreak vs injection

PromptInjectionGuard is the cheaper, narrower detector aimed at attacks against your tool wiring (override + exfil). JailbreakGuard is broader and catches role-play and policy-override framing against the model. Use injection before LLM calls; add jailbreak on paths where model output is consumed by downstream tools.

Next Steps

Response Sanitization :: complement: scan output for secrets and PII.
Shell & Code Guards :: enforce structural restrictions on submitted code in addition to textual scanning.
External Guards :: route the same text to cloud content-safety APIs.