Advanced Direct Injection
Modern direct injection attacks use sophisticated techniques to bypass safety training, content filters, and system prompt instructions. Understanding these techniques is essential for building effective defenses.
Encoding and Obfuscation Attacks
Attackers use various encoding schemes to disguise malicious instructions so they bypass text-based filters but are still understood by the LLM:
| Technique | How It Works | Defense |
|---|---|---|
| Base64 Encoding | Encode malicious prompt in Base64, ask model to decode and follow | Detect and block Base64 patterns in input |
| Character Substitution | Use Unicode homoglyphs (Cyrillic 'a' for Latin 'a') to bypass keyword filters | Unicode normalization before filtering |
| Leetspeak / Pig Latin | Transform text to bypass word-level filters while remaining comprehensible to LLM | Semantic-level analysis rather than keyword matching |
| Token Boundary Exploitation | Insert zero-width characters to split tokens and bypass tokenizer-level filters | Strip zero-width and control characters |
| Markdown/Code Abuse | Embed instructions in code blocks, markdown formatting, or HTML comments | Parse and analyze structured content separately |
Advanced Jailbreaking Techniques
-
Many-shot Jailbreaking
Providing dozens or hundreds of examples of the desired (harmful) output in the prompt context, exploiting in-context learning to override safety training. The sheer volume of examples overwhelms alignment.
-
Persona-based Attacks
Creating elaborate fictional scenarios where the model adopts a persona that is not bound by safety guidelines. Advanced versions use nested personas or "model simulation" framings.
-
Logic Exploitation
Framing harmful requests as logical puzzles, hypotheticals, or academic exercises that bypass content-level refusal while eliciting equivalent information.
-
Crescendo Attacks
Gradually escalating requests across many turns, starting with completely benign queries and slowly shifting toward harmful territory. Each step is small enough to not trigger refusal.
Multi-modal Injection
As LLMs gain the ability to process images, audio, and video, new injection surfaces emerge:
# Multi-modal attack vectors:
# 1. Text-in-image injection
# Embed instructions as text within an image that the
# vision model reads and follows as instructions
# 2. Steganographic injection
# Hide instructions in image pixel data that is invisible
# to humans but detected by the vision encoder
# 3. Audio injection
# Embed ultrasonic or whispered instructions in audio
# files processed by speech-to-text pipelines
# 4. Structured data injection
# Hide instructions in JSON metadata, CSV fields, or
# XML attributes that are processed by the model
Defending Against Advanced Direct Injection
Input Normalization
Normalize Unicode, strip control characters, decode encoded content, and standardize formatting before any security analysis is performed.
Semantic Analysis
Use ML classifiers that understand the semantic intent of input rather than relying on keyword matching. Train on diverse injection datasets.
Multi-turn Context Tracking
Analyze the full conversation trajectory, not just individual messages. Detect gradual escalation patterns and context manipulation across turns.
Cross-modal Scanning
For multi-modal models, scan images for embedded text, analyze audio transcriptions, and parse structured data before including in context.
Lilly Tech Systems