Intermediate

Prompt Attacks

Prompt attacks are the most common and well-studied class of LLM vulnerabilities. Understanding the full taxonomy of attacks is essential for building effective defenses.

Direct Injection

In direct injection, the user explicitly provides instructions designed to override the system prompt or bypass safety training:

  • Instruction override: "Ignore all previous instructions and..." — the simplest form, often caught by basic filters
  • Role-playing attacks: "Pretend you are DAN (Do Anything Now)..." — exploiting the model's role-play capabilities
  • Encoding attacks: Using Base64, ROT13, pig latin, or other encodings to bypass text-based filters
  • Language switching: Providing instructions in a low-resource language where safety training is weaker
  • Token smuggling: Using Unicode homoglyphs or zero-width characters to evade pattern matching

Indirect Injection

Indirect injection is more dangerous because the malicious instructions come from external data sources rather than the user:

Vector Attack Method Impact
Web Pages Hidden text on pages the LLM browses Data exfiltration, action manipulation
Emails Instructions embedded in email content processed by AI assistants Unauthorized forwarding, calendar manipulation
Documents Injections in PDFs, spreadsheets, or docs uploaded for analysis System prompt extraction, data leakage
RAG Knowledge Base Poisoned entries in vector databases or document stores Persistent injection affecting all users
API Responses Malicious data returned by external APIs consumed by the LLM Tool chain compromise

Multi-turn Attack Strategies

  1. Crescendo Attacks

    Gradually escalating requests across multiple turns, starting with innocent questions and incrementally pushing toward harmful territory. Each turn builds on the established context.

  2. Context Manipulation

    Using earlier conversation turns to establish a false context that makes later malicious requests appear legitimate. For example, creating a fictional "security audit" scenario.

  3. Payload Splitting

    Splitting a malicious payload across multiple messages so that no single message triggers safety filters, but the combined context achieves the attack goal.

  4. Many-shot Jailbreaking

    Providing many examples of the desired (harmful) behavior in the prompt, exploiting in-context learning to override safety training through sheer volume of examples.

Input and Output Security

# Defense-in-depth for prompt attacks
class LLMSecurityPipeline:
    def process_request(self, user_input):
        # Layer 1: Input validation
        if self.detect_injection(user_input):
            return self.blocked_response()

        # Layer 2: System prompt hardening
        messages = self.build_hardened_prompt(user_input)

        # Layer 3: Model inference
        response = self.model.generate(messages)

        # Layer 4: Output validation
        if self.detect_unsafe_output(response):
            return self.safe_fallback()

        # Layer 5: Output sanitization
        return self.sanitize_output(response)
💡
Looking Ahead: In the next lesson, we will explore data leakage — how LLMs can expose training data, PII, system prompts, and other sensitive information through carefully crafted queries.