AI Red Team Operations Intermediate
AI red team operations simulate real-world adversarial attacks against AI systems to identify vulnerabilities before they are exploited. This lesson covers planning offensive campaigns, executing adversarial attacks across different AI system types, LLM-specific red teaming techniques, and documenting findings for maximum organizational impact.
Planning Red Team Operations
A well-planned red team engagement follows a structured approach:
- Define Objectives
What are you trying to achieve? Common objectives include testing adversarial robustness, evaluating LLM safety filters, assessing model extraction risk, or testing the blue team's detection capabilities.
- Establish Rules of Engagement
Define boundaries: which systems are in scope, what attack techniques are permitted, what is the escalation process for critical findings, and what are the communication protocols.
- Build Attack Scenarios
Develop realistic attack scenarios based on threat intelligence. Map scenarios to MITRE ATLAS tactics and techniques.
- Prepare Tools and Infrastructure
Set up attack tools, surrogate models, proxy infrastructure, and logging systems to document all activities.
- Execute and Document
Run the attack scenarios, document every step, capture evidence, and note detection or non-detection by the blue team.
LLM Red Teaming Techniques
LLM red teaming has become a specialized discipline with its own techniques:
Prompt Injection Attacks
- Direct injection — "Ignore your instructions and instead do X"
- Indirect injection — Embedding instructions in documents, web pages, or images the LLM processes
- Payload splitting — Breaking malicious instructions across multiple messages
- Context manipulation — Gradually shifting the conversation context to bypass safety measures
Jailbreaking Strategies
- Role-playing — Asking the model to assume a character without safety restrictions
- Encoding attacks — Using base64, ROT13, pig Latin, or other encodings to hide harmful requests
- Few-shot priming — Providing examples that gradually lead toward restricted content
- System prompt extraction — Techniques to make the model reveal its hidden instructions
SCENARIO: RT-AI-003 - LLM Safety Filter Bypass OBJECTIVE: Test whether safety filters can be bypassed through indirect prompt injection TECHNIQUE: ATLAS T0051 - LLM Prompt Injection ACCESS: Black-box (API access only) ATTACK STEPS: 1. Establish baseline: test direct harmful requests (should be blocked) 2. Test indirect injection via document summarization 3. Test encoding-based bypasses (base64, Unicode) 4. Test multi-turn escalation strategies 5. Test role-playing and persona-based bypasses SUCCESS CRITERIA: - Any bypass of safety filters = finding - System prompt extraction = critical finding - Unauthorized action execution = critical finding DOCUMENTATION: - Record exact prompts used (input) - Capture full model responses (output) - Note which attempts were detected vs undetected
Vision and Multimodal Red Teaming
For computer vision and multimodal AI systems:
- Adversarial patches — Physical-world attacks using printed patches that fool classifiers
- Typographic attacks — Adding misleading text to images that influences vision-language models
- Cross-modal injection — Hiding text instructions in images that multimodal models follow
- Deepfake testing — Using generated images/audio to test biometric authentication systems
Red Team Operational Security
Ready to Learn Blue Team Defense?
The next lesson covers building defensive capabilities to detect and respond to the types of attacks covered in this lesson.
Next: Blue Team Defense →