Chio/Docs

Jailbreak & Injection Guards

Two opt-in guards scan free-form text for adversarial framing. JailbreakGuard wraps a three-layer detector (heuristic regex, statistical features, linear ML scorer) and denies when the blended score crosses a threshold. PromptInjectionGuard runs a smaller six-signal regex over the same canonicalized input. Both share a SHA-256 fingerprint LRU so retried payloads short-circuit. Neither is in the default pipeline; both are opt-in.

Not in the default pipeline

These guards are opt-in by design. GuardPipeline::default_pipeline() does not include them. Reasons: regex and feature-extraction overhead on cache-miss, the false-positive risk of LLM-shaped detectors, and the volatility of the heuristic table as attacker techniques shift. Register them via kernel.add_guard(Box::new(JailbreakGuard::default())) when you accept those tradeoffs.
Three-layer jailbreak detectorHeuristic regex signals · layer weight 0.70Statistical features · layer weight 0.10ML scoring · layer weight 0.20ignore / disregard / bypass / override / disable near policy / rules / safetyjb_ignore_policyweight 0.9DAN, do-anything-now, evil-confidant, unfiltered, unrestricted, jailbreakjb_dan_unfilteredweight 0.9reveal / show / leak near system prompt or developer instructionsjb_system_prompt_extractionweight 0.95you are now / act as / pretend to be / roleplay as / from now on you arejb_role_changeweight 0.7base64 / rot13 / url-encode / decode-this / decode-the-followingjb_encoded_payloadweight 0.6developer / debug / god / admin / sudo mode framingsjb_developer_modeweight 0.8non-whitespace symbol fraction at or above 0.35punctuation ratiostat_punctuation_ratio_highbits per character of non-whitespace ASCII at or above 4.8Shannon entropystat_char_entropy_highcontiguous run of non-alnum non-whitespace chars of length 12 or morelong symbol runsstat_long_symbol_runchar 3-gram uniqueness ratio below 0.35shingle uniquenessstat_low_shingle_uniquenesszero-width characters present in the original input before canonicalizationzero-width codepointsstat_zero_width_obfuscationLinearModel::default() with bias -2.0 and per-feature weights; output is sigmoid(z) clamped to [0, 1]Linear model + sigmoidblends features into [0,1]score = clamp01(h_clamped*0.70 + s_clamped*0.10 + ml*0.20)weighted blendscale 1.0 · threshold 0.75Default deny threshold is DEFAULT_DENY_THRESHOLD = 0.75Verdictscore >= 0.75 -> Deny, else Allowweight 0.70weight 0.10weight 0.20blended scoreSHA-256 fingerprint LRU (capacity 1024) short-circuits re-evaluationon retried payloads with the same canonicalized textlayers are clamped to [0, 1] before weighting · blend is clamped to [0, 1]accent = dominant signal path · default = supplementary contribution
Three-layer jailbreak detector. Heuristic regex signals dominate (weight 0.70), statistical features and the ML linear model contribute smaller weights. Final score is sigmoid-blended; the default deny threshold is 0.75.

JailbreakGuard

Source: crates/chio-guards/src/jailbreak.rs (guard wrapper) and jailbreak_detector.rs (detector engine). Guard name: jailbreak.

Struct

crates/chio-guards/src/jailbreak.rs
#[derive(Clone, Debug)]
pub struct JailbreakGuardConfig {
    pub threshold: f32,
    pub layer_weights: LayerWeights,
    pub fingerprint_dedup_capacity: usize,
    pub detector: DetectorConfig,
}

pub struct JailbreakGuard {
    config: JailbreakGuardConfig,
    detector: JailbreakDetector,
    dedup: Mutex<LruCache<String, bool>>,
}

pub const DEFAULT_FINGERPRINT_CAPACITY: usize = 1024;

Defaults

KnobTypeDefaultSource constant
thresholdf320.75DEFAULT_DENY_THRESHOLD
layer_weights.heuristicf320.70LayerWeights::default()
layer_weights.statisticalf320.10LayerWeights::default()
layer_weights.mlf320.20LayerWeights::default()
layer_weights.heuristic_divisorf321.0LayerWeights::default()
fingerprint_dedup_capacityusize1024DEFAULT_FINGERPRINT_CAPACITY
detector.max_scan_bytesusize65536 (64 KiB)DEFAULT_MAX_SCAN_BYTES

Heuristic signals (layer 1)

Patterns operate over canonicalized text (lowercased ASCII, zero-widths stripped, homoglyphs folded, separator runs collapsed). Each pattern fires a stable signal ID:

Signal IDCategoryWeightPattern intent
jb_ignore_policyAuthorityConfusion0.9"ignore/disregard/bypass/override/disable" near "policy/rules/safety/guardrails/safeguards".
jb_dan_unfilteredRolePlay0.9DAN, do-anything-now, evil-confidant, unfiltered, unrestricted, jailbreak.
jb_system_prompt_extractionInstructionExtraction0.95"reveal/show/print/output/leak" near "system prompt", "developer instructions", "hidden instructions".
jb_role_changeRolePlay0.7"you are now", "act as", "pretend to be", "roleplay as", "from now on you are".
jb_encoded_payloadEncodingAttack0.6base64, rot13, url-encode, decode-this, decode-the-following.
jb_developer_modeAuthorityConfusion0.8developer/debug/god/admin/sudo mode, enable developer/debug mode.

Statistical signals (layer 2)

Each fires a stable ID under category AdversarialSuffix. Each contributes a fixed 0.2 to the layer score. The layer is bounded in[0.0, 1.0] for up to five signals.

Signal IDThreshold defaultConstantTest
stat_punctuation_ratio_high0.35DEFAULT_PUNCT_RATIO_THRESHOLDnon-whitespace symbol fraction >= threshold
stat_char_entropy_high4.8DEFAULT_ENTROPY_THRESHOLDShannon bits/char of non-whitespace ASCII
stat_long_symbol_run12DEFAULT_SYMBOL_RUN_MINcontiguous run of non-alnum non-whitespace chars of this length
stat_low_shingle_uniqueness0.35DEFAULT_SHINGLE_UNIQUENESS_THRESHOLDchar-3-gram uniqueness ratio < threshold
stat_zero_width_obfuscation> 0(no constant)zero-width codepoints in original input (counted before canonicalization)

Default shingle size is DEFAULT_SHINGLE_N = 3 (char 3-grams).

ML layer (layer 3)

A small linear model with sigmoid activation. Inputs are 0/1 feature flags from layers 1 and 2 plus the (continuous) punctuation ratio. Default weights from LinearModel::default():

crates/chio-guards/src/jailbreak_detector.rs
bias: -2.0,
w_ignore_policy: 2.5,
w_dan: 2.0,
w_role_change: 1.5,
w_prompt_extraction: 2.2,
w_encoded: 1.0,
w_developer_mode: 2.0,
w_punct: 2.0,
w_symbol_run: 1.5,
w_low_shingle_uniqueness: 1.2,
w_zero_width: 1.0,

Output is sigmoid(bias + sum(w_i * x_i)) clamped to [0.0, 1.0]. With default bias -2.0, benign input produces ML score around 0.12. A multi-flag attack saturates the output toward 1.0.

Layer blend

crates/chio-guards/src/jailbreak_detector.rs
let h_div = weights.heuristic_divisor.max(f32::EPSILON);
let h_clamped = (heuristic_score / h_div).clamp(0.0, 1.0);
let s_clamped = statistical_score.clamp(0.0, 1.0);
let score = (h_clamped * weights.heuristic
    + s_clamped * weights.statistical
    + ml_score * weights.ml)
    .clamp(0.0, 1.0);

With defaults, a single dominant heuristic at weight 0.95 contributes 0.95 * 0.70 = 0.665 on its own; layered with even a weak ML reinforcement the request clears the 0.75 deny threshold. Multi-pattern attacks reach saturation cleanly.

Fingerprint dedup

The guard hashes the canonicalized text with SHA-256 and stores the first 8 bytes (16 hex chars) in a bounded LRU keyed by that fingerprint. On hit, if the prior verdict was deny, the guard returns deny without re-running detection. Capacity defaults to 1024; 0 is internally rounded up to NonZeroUsize::MIN.

Failure modes

  • Empty / whitespace-only text :: allow.
  • Heuristic regex compile failure :: tracing::error! and the signal is dropped from the pattern table for the process lifetime. Detection continues without it.
  • Mutex poisoning :: deny (Verdict::Deny returned directly from evaluate_text).
  • The detector port preserves the upstream signal IDs (jb_ignore_policy, etc.) intentionally so log-analysis tooling that knows the upstream taxonomy continues to work.

PromptInjectionGuard

Source: crates/chio-guards/src/prompt_injection.rs. Guard name: prompt-injection. Six regex signals over canonicalized input. Score is the sum of fired signal weights; deny when total >= threshold.

Struct

crates/chio-guards/src/prompt_injection.rs
#[derive(Clone, Debug)]
pub struct PromptInjectionConfig {
    pub score_threshold: f32,
    pub max_scan_bytes: usize,
    pub fingerprint_capacity: usize,
}

pub struct PromptInjectionGuard {
    config: PromptInjectionConfig,
    patterns: Patterns,
    dedup: Mutex<LruCache<String, bool>>,
}

Defaults

KnobDefaultConstant
score_threshold0.8DEFAULT_SCORE_THRESHOLD
max_scan_bytes65536 (64 KiB)DEFAULT_MAX_SCAN_BYTES
fingerprint_capacity1024DEFAULT_FINGERPRINT_CAPACITY

Signals

VariantStable IDDefault weightCatches
InstructionOverrideinstruction_override0.9"ignore previous instructions"-style override.
RoleInjectionrole_injection0.4"you are now", <|assistant|>.
DelimiterInjectiondelimiter_injection0.3System-role delimiter tokens ([system], [/INST], etc.).
OutputHijackoutput_hijack0.3"respond with exactly", "output only".
ToolChainHijacktool_chain_hijack0.3"call tool X with", "use function X to".
ExfiltrationFramingexfiltration_framing0.5URLs / email / POST language near data tokens.

At the default 0.8 threshold, InstructionOverride alone clears the bar. Other signals require corroboration; e.g., role injection plus exfiltration framing reaches 0.9.

Failure modes

  • Empty input :: allow.
  • Mutex poisoning :: Verdict::Deny (fail-closed).
  • Unrecognized ToolAction variant :: allow (the guard does not apply).

Text candidate extraction

Both guards share an extraction routine:

  1. Pull strings from the action variant: code body for CodeExecution, query for DatabaseQuery, endpoint for ExternalApiCall.
  2. Recurse into arguments JSON and append every string leaf.
  3. Drop empty / whitespace-only strings.

Each candidate runs through the dedup cache, then through detection. First deny short-circuits the candidate loop.


Composition

rust
use chio_guards::{JailbreakGuard, PromptInjectionGuard};

let mut pipeline = chio_guards::GuardPipeline::default_pipeline();

// Cheap signal first.
pipeline.add(Box::new(PromptInjectionGuard::default()));

// Heavier multi-layer detector second.
pipeline.add(Box::new(JailbreakGuard::default()));

Both guards share the same canonicalization pass and the same fingerprint shape (first 8 bytes of SHA-256), so running them back-to-back on the same input does not duplicate the cache key ergonomics. They still maintain separate LRU caches.

When to enable jailbreak vs injection

PromptInjectionGuard is the cheaper, narrower detector aimed at attacks against your tool wiring (override + exfil). JailbreakGuard is broader and catches role-play and policy-override framings against the model itself. Most operators run injection in front of every LLM call and reserve jailbreak for paths where the model output is consumed by downstream tools.

Next Steps

Jailbreak & Injection Guards · Chio Docs