Jailbreak & Injection Guards
Two opt-in guards scan free-form text for adversarial framing. JailbreakGuard wraps a three-layer detector (heuristic regex, statistical features, linear ML scorer) and denies when the blended score crosses a threshold. PromptInjectionGuard runs a smaller six-signal regex over the same canonicalized input. Both share a SHA-256 fingerprint LRU so retried payloads short-circuit. Neither is in the default pipeline; both are opt-in.
Not in the default pipeline
GuardPipeline::default_pipeline() does not include them. Reasons: regex and feature-extraction overhead on cache-miss, the false-positive risk of LLM-shaped detectors, and the volatility of the heuristic table as attacker techniques shift. Register them via kernel.add_guard(Box::new(JailbreakGuard::default())) when you accept those tradeoffs.JailbreakGuard
Source: crates/chio-guards/src/jailbreak.rs (guard wrapper) and jailbreak_detector.rs (detector engine). Guard name: jailbreak.
Struct
#[derive(Clone, Debug)]
pub struct JailbreakGuardConfig {
pub threshold: f32,
pub layer_weights: LayerWeights,
pub fingerprint_dedup_capacity: usize,
pub detector: DetectorConfig,
}
pub struct JailbreakGuard {
config: JailbreakGuardConfig,
detector: JailbreakDetector,
dedup: Mutex<LruCache<String, bool>>,
}
pub const DEFAULT_FINGERPRINT_CAPACITY: usize = 1024;Defaults
| Knob | Type | Default | Source constant |
|---|---|---|---|
threshold | f32 | 0.75 | DEFAULT_DENY_THRESHOLD |
layer_weights.heuristic | f32 | 0.70 | LayerWeights::default() |
layer_weights.statistical | f32 | 0.10 | LayerWeights::default() |
layer_weights.ml | f32 | 0.20 | LayerWeights::default() |
layer_weights.heuristic_divisor | f32 | 1.0 | LayerWeights::default() |
fingerprint_dedup_capacity | usize | 1024 | DEFAULT_FINGERPRINT_CAPACITY |
detector.max_scan_bytes | usize | 65536 (64 KiB) | DEFAULT_MAX_SCAN_BYTES |
Heuristic signals (layer 1)
Patterns operate over canonicalized text (lowercased ASCII, zero-widths stripped, homoglyphs folded, separator runs collapsed). Each pattern fires a stable signal ID:
| Signal ID | Category | Weight | Pattern intent |
|---|---|---|---|
jb_ignore_policy | AuthorityConfusion | 0.9 | "ignore/disregard/bypass/override/disable" near "policy/rules/safety/guardrails/safeguards". |
jb_dan_unfiltered | RolePlay | 0.9 | DAN, do-anything-now, evil-confidant, unfiltered, unrestricted, jailbreak. |
jb_system_prompt_extraction | InstructionExtraction | 0.95 | "reveal/show/print/output/leak" near "system prompt", "developer instructions", "hidden instructions". |
jb_role_change | RolePlay | 0.7 | "you are now", "act as", "pretend to be", "roleplay as", "from now on you are". |
jb_encoded_payload | EncodingAttack | 0.6 | base64, rot13, url-encode, decode-this, decode-the-following. |
jb_developer_mode | AuthorityConfusion | 0.8 | developer/debug/god/admin/sudo mode, enable developer/debug mode. |
Statistical signals (layer 2)
Each fires a stable ID under category AdversarialSuffix. Each contributes a fixed 0.2 to the layer score. The layer is bounded in[0.0, 1.0] for up to five signals.
| Signal ID | Threshold default | Constant | Test |
|---|---|---|---|
stat_punctuation_ratio_high | 0.35 | DEFAULT_PUNCT_RATIO_THRESHOLD | non-whitespace symbol fraction >= threshold |
stat_char_entropy_high | 4.8 | DEFAULT_ENTROPY_THRESHOLD | Shannon bits/char of non-whitespace ASCII |
stat_long_symbol_run | 12 | DEFAULT_SYMBOL_RUN_MIN | contiguous run of non-alnum non-whitespace chars of this length |
stat_low_shingle_uniqueness | 0.35 | DEFAULT_SHINGLE_UNIQUENESS_THRESHOLD | char-3-gram uniqueness ratio < threshold |
stat_zero_width_obfuscation | > 0 | (no constant) | zero-width codepoints in original input (counted before canonicalization) |
Default shingle size is DEFAULT_SHINGLE_N = 3 (char 3-grams).
ML layer (layer 3)
A small linear model with sigmoid activation. Inputs are 0/1 feature flags from layers 1 and 2 plus the (continuous) punctuation ratio. Default weights from LinearModel::default():
bias: -2.0,
w_ignore_policy: 2.5,
w_dan: 2.0,
w_role_change: 1.5,
w_prompt_extraction: 2.2,
w_encoded: 1.0,
w_developer_mode: 2.0,
w_punct: 2.0,
w_symbol_run: 1.5,
w_low_shingle_uniqueness: 1.2,
w_zero_width: 1.0,Output is sigmoid(bias + sum(w_i * x_i)) clamped to [0.0, 1.0]. With default bias -2.0, benign input produces ML score around 0.12. A multi-flag attack saturates the output toward 1.0.
Layer blend
let h_div = weights.heuristic_divisor.max(f32::EPSILON);
let h_clamped = (heuristic_score / h_div).clamp(0.0, 1.0);
let s_clamped = statistical_score.clamp(0.0, 1.0);
let score = (h_clamped * weights.heuristic
+ s_clamped * weights.statistical
+ ml_score * weights.ml)
.clamp(0.0, 1.0);With defaults, a single dominant heuristic at weight 0.95 contributes 0.95 * 0.70 = 0.665 on its own; layered with even a weak ML reinforcement the request clears the 0.75 deny threshold. Multi-pattern attacks reach saturation cleanly.
Fingerprint dedup
The guard hashes the canonicalized text with SHA-256 and stores the first 8 bytes (16 hex chars) in a bounded LRU keyed by that fingerprint. On hit, if the prior verdict was deny, the guard returns deny without re-running detection. Capacity defaults to 1024; 0 is internally rounded up to NonZeroUsize::MIN.
Failure modes
- Empty / whitespace-only text :: allow.
- Heuristic regex compile failure ::
tracing::error!and the signal is dropped from the pattern table for the process lifetime. Detection continues without it. - Mutex poisoning :: deny (
Verdict::Denyreturned directly fromevaluate_text). - The detector port preserves the upstream signal IDs (
jb_ignore_policy, etc.) intentionally so log-analysis tooling that knows the upstream taxonomy continues to work.
PromptInjectionGuard
Source: crates/chio-guards/src/prompt_injection.rs. Guard name: prompt-injection. Six regex signals over canonicalized input. Score is the sum of fired signal weights; deny when total >= threshold.
Struct
#[derive(Clone, Debug)]
pub struct PromptInjectionConfig {
pub score_threshold: f32,
pub max_scan_bytes: usize,
pub fingerprint_capacity: usize,
}
pub struct PromptInjectionGuard {
config: PromptInjectionConfig,
patterns: Patterns,
dedup: Mutex<LruCache<String, bool>>,
}Defaults
| Knob | Default | Constant |
|---|---|---|
score_threshold | 0.8 | DEFAULT_SCORE_THRESHOLD |
max_scan_bytes | 65536 (64 KiB) | DEFAULT_MAX_SCAN_BYTES |
fingerprint_capacity | 1024 | DEFAULT_FINGERPRINT_CAPACITY |
Signals
| Variant | Stable ID | Default weight | Catches |
|---|---|---|---|
InstructionOverride | instruction_override | 0.9 | "ignore previous instructions"-style override. |
RoleInjection | role_injection | 0.4 | "you are now", <|assistant|>. |
DelimiterInjection | delimiter_injection | 0.3 | System-role delimiter tokens ([system], [/INST], etc.). |
OutputHijack | output_hijack | 0.3 | "respond with exactly", "output only". |
ToolChainHijack | tool_chain_hijack | 0.3 | "call tool X with", "use function X to". |
ExfiltrationFraming | exfiltration_framing | 0.5 | URLs / email / POST language near data tokens. |
At the default 0.8 threshold, InstructionOverride alone clears the bar. Other signals require corroboration; e.g., role injection plus exfiltration framing reaches 0.9.
Failure modes
- Empty input :: allow.
- Mutex poisoning ::
Verdict::Deny(fail-closed). - Unrecognized
ToolActionvariant :: allow (the guard does not apply).
Text candidate extraction
Both guards share an extraction routine:
- Pull strings from the action variant: code body for
CodeExecution, query forDatabaseQuery, endpoint forExternalApiCall. - Recurse into
argumentsJSON and append every string leaf. - Drop empty / whitespace-only strings.
Each candidate runs through the dedup cache, then through detection. First deny short-circuits the candidate loop.
Composition
use chio_guards::{JailbreakGuard, PromptInjectionGuard};
let mut pipeline = chio_guards::GuardPipeline::default_pipeline();
// Cheap signal first.
pipeline.add(Box::new(PromptInjectionGuard::default()));
// Heavier multi-layer detector second.
pipeline.add(Box::new(JailbreakGuard::default()));Both guards share the same canonicalization pass and the same fingerprint shape (first 8 bytes of SHA-256), so running them back-to-back on the same input does not duplicate the cache key ergonomics. They still maintain separate LRU caches.
When to enable jailbreak vs injection
PromptInjectionGuard is the cheaper, narrower detector aimed at attacks against your tool wiring (override + exfil). JailbreakGuard is broader and catches role-play and policy-override framings against the model itself. Most operators run injection in front of every LLM call and reserve jailbreak for paths where the model output is consumed by downstream tools.Next Steps
- Response Sanitization :: complement: scan output for secrets and PII.
- Shell & Code Guards :: enforce structural restrictions on submitted code in addition to textual scanning.
- External Guards :: route the same text to cloud content-safety APIs.