Prompt Attacks
Prompt attacks are the most common and well-studied class of LLM vulnerabilities. Understanding the full taxonomy of attacks is essential for building effective defenses.
Direct Injection
In direct injection, the user explicitly provides instructions designed to override the system prompt or bypass safety training:
- Instruction override: "Ignore all previous instructions and..." — the simplest form, often caught by basic filters
- Role-playing attacks: "Pretend you are DAN (Do Anything Now)..." — exploiting the model's role-play capabilities
- Encoding attacks: Using Base64, ROT13, pig latin, or other encodings to bypass text-based filters
- Language switching: Providing instructions in a low-resource language where safety training is weaker
- Token smuggling: Using Unicode homoglyphs or zero-width characters to evade pattern matching
Indirect Injection
Indirect injection is more dangerous because the malicious instructions come from external data sources rather than the user:
| Vector | Attack Method | Impact |
|---|---|---|
| Web Pages | Hidden text on pages the LLM browses | Data exfiltration, action manipulation |
| Emails | Instructions embedded in email content processed by AI assistants | Unauthorized forwarding, calendar manipulation |
| Documents | Injections in PDFs, spreadsheets, or docs uploaded for analysis | System prompt extraction, data leakage |
| RAG Knowledge Base | Poisoned entries in vector databases or document stores | Persistent injection affecting all users |
| API Responses | Malicious data returned by external APIs consumed by the LLM | Tool chain compromise |
Multi-turn Attack Strategies
-
Crescendo Attacks
Gradually escalating requests across multiple turns, starting with innocent questions and incrementally pushing toward harmful territory. Each turn builds on the established context.
-
Context Manipulation
Using earlier conversation turns to establish a false context that makes later malicious requests appear legitimate. For example, creating a fictional "security audit" scenario.
-
Payload Splitting
Splitting a malicious payload across multiple messages so that no single message triggers safety filters, but the combined context achieves the attack goal.
-
Many-shot Jailbreaking
Providing many examples of the desired (harmful) behavior in the prompt, exploiting in-context learning to override safety training through sheer volume of examples.
Input and Output Security
# Defense-in-depth for prompt attacks
class LLMSecurityPipeline:
def process_request(self, user_input):
# Layer 1: Input validation
if self.detect_injection(user_input):
return self.blocked_response()
# Layer 2: System prompt hardening
messages = self.build_hardened_prompt(user_input)
# Layer 3: Model inference
response = self.model.generate(messages)
# Layer 4: Output validation
if self.detect_unsafe_output(response):
return self.safe_fallback()
# Layer 5: Output sanitization
return self.sanitize_output(response)
Lilly Tech Systems