Advanced

Advanced Direct Injection

Modern direct injection attacks use sophisticated techniques to bypass safety training, content filters, and system prompt instructions. Understanding these techniques is essential for building effective defenses.

Encoding and Obfuscation Attacks

Attackers use various encoding schemes to disguise malicious instructions so they bypass text-based filters but are still understood by the LLM:

Technique How It Works Defense
Base64 Encoding Encode malicious prompt in Base64, ask model to decode and follow Detect and block Base64 patterns in input
Character Substitution Use Unicode homoglyphs (Cyrillic 'a' for Latin 'a') to bypass keyword filters Unicode normalization before filtering
Leetspeak / Pig Latin Transform text to bypass word-level filters while remaining comprehensible to LLM Semantic-level analysis rather than keyword matching
Token Boundary Exploitation Insert zero-width characters to split tokens and bypass tokenizer-level filters Strip zero-width and control characters
Markdown/Code Abuse Embed instructions in code blocks, markdown formatting, or HTML comments Parse and analyze structured content separately
Adversarial Suffixes: Research by Zou et al. demonstrated that automatically generated adversarial suffixes — seemingly random strings of characters — can reliably jailbreak aligned language models. These suffixes are transferable across different models, making them especially dangerous.

Advanced Jailbreaking Techniques

  1. Many-shot Jailbreaking

    Providing dozens or hundreds of examples of the desired (harmful) output in the prompt context, exploiting in-context learning to override safety training. The sheer volume of examples overwhelms alignment.

  2. Persona-based Attacks

    Creating elaborate fictional scenarios where the model adopts a persona that is not bound by safety guidelines. Advanced versions use nested personas or "model simulation" framings.

  3. Logic Exploitation

    Framing harmful requests as logical puzzles, hypotheticals, or academic exercises that bypass content-level refusal while eliciting equivalent information.

  4. Crescendo Attacks

    Gradually escalating requests across many turns, starting with completely benign queries and slowly shifting toward harmful territory. Each step is small enough to not trigger refusal.

Multi-modal Injection

As LLMs gain the ability to process images, audio, and video, new injection surfaces emerge:

# Multi-modal attack vectors:

# 1. Text-in-image injection
# Embed instructions as text within an image that the
# vision model reads and follows as instructions

# 2. Steganographic injection
# Hide instructions in image pixel data that is invisible
# to humans but detected by the vision encoder

# 3. Audio injection
# Embed ultrasonic or whispered instructions in audio
# files processed by speech-to-text pipelines

# 4. Structured data injection
# Hide instructions in JSON metadata, CSV fields, or
# XML attributes that are processed by the model

Defending Against Advanced Direct Injection

Input Normalization

Normalize Unicode, strip control characters, decode encoded content, and standardize formatting before any security analysis is performed.

Semantic Analysis

Use ML classifiers that understand the semantic intent of input rather than relying on keyword matching. Train on diverse injection datasets.

Multi-turn Context Tracking

Analyze the full conversation trajectory, not just individual messages. Detect gradual escalation patterns and context manipulation across turns.

Cross-modal Scanning

For multi-modal models, scan images for embedded text, analyze audio transcriptions, and parse structured data before including in context.

💡
Looking Ahead: In the next lesson, we will explore advanced indirect injection — sophisticated attacks that come through external data sources and are far harder to defend against.