Intermediate

Prompt Attacks

Prompt attacks are the most common and well-studied class of LLM vulnerabilities. Understanding the full taxonomy of attacks is essential for building effective defenses.

Direct Injection

In direct injection, the user explicitly provides instructions designed to override the system prompt or bypass safety training:

Instruction override: "Ignore all previous instructions and..." — the simplest form, often caught by basic filters
Role-playing attacks: "Pretend you are DAN (Do Anything Now)..." — exploiting the model's role-play capabilities
Encoding attacks: Using Base64, ROT13, pig latin, or other encodings to bypass text-based filters
Language switching: Providing instructions in a low-resource language where safety training is weaker
Token smuggling: Using Unicode homoglyphs or zero-width characters to evade pattern matching

Indirect Injection

Indirect injection is more dangerous because the malicious instructions come from external data sources rather than the user:

Vector	Attack Method	Impact
Web Pages	Hidden text on pages the LLM browses	Data exfiltration, action manipulation
Emails	Instructions embedded in email content processed by AI assistants	Unauthorized forwarding, calendar manipulation
Documents	Injections in PDFs, spreadsheets, or docs uploaded for analysis	System prompt extraction, data leakage
RAG Knowledge Base	Poisoned entries in vector databases or document stores	Persistent injection affecting all users
API Responses	Malicious data returned by external APIs consumed by the LLM	Tool chain compromise

Multi-turn Attack Strategies

Crescendo Attacks

Gradually escalating requests across multiple turns, starting with innocent questions and incrementally pushing toward harmful territory. Each turn builds on the established context.
Context Manipulation

Using earlier conversation turns to establish a false context that makes later malicious requests appear legitimate. For example, creating a fictional "security audit" scenario.
Payload Splitting

Splitting a malicious payload across multiple messages so that no single message triggers safety filters, but the combined context achieves the attack goal.
Many-shot Jailbreaking

Providing many examples of the desired (harmful) behavior in the prompt, exploiting in-context learning to override safety training through sheer volume of examples.

Input and Output Security

# Defense-in-depth for prompt attacks
class LLMSecurityPipeline:
    def process_request(self, user_input):
        # Layer 1: Input validation
        if self.detect_injection(user_input):
            return self.blocked_response()

        # Layer 2: System prompt hardening
        messages = self.build_hardened_prompt(user_input)

        # Layer 3: Model inference
        response = self.model.generate(messages)

        # Layer 4: Output validation
        if self.detect_unsafe_output(response):
            return self.safe_fallback()

        # Layer 5: Output sanitization
        return self.sanitize_output(response)

💡

Looking Ahead: In the next lesson, we will explore data leakage — how LLMs can expose training data, PII, system prompts, and other sensitive information through carefully crafted queries.

← PreviousAttack Surface Next →Data Leakage

Prompt Attacks

Direct Injection

Indirect Injection

Multi-turn Attack Strategies

Crescendo Attacks

Context Manipulation

Payload Splitting

Many-shot Jailbreaking

Input and Output Security