AI Red Team Operations Intermediate

AI red team operations simulate real-world adversarial attacks against AI systems to identify vulnerabilities before they are exploited. This lesson covers planning offensive campaigns, executing adversarial attacks across different AI system types, LLM-specific red teaming techniques, and documenting findings for maximum organizational impact.

Planning Red Team Operations

A well-planned red team engagement follows a structured approach:

Define Objectives
What are you trying to achieve? Common objectives include testing adversarial robustness, evaluating LLM safety filters, assessing model extraction risk, or testing the blue team's detection capabilities.
Establish Rules of Engagement
Define boundaries: which systems are in scope, what attack techniques are permitted, what is the escalation process for critical findings, and what are the communication protocols.
Build Attack Scenarios
Develop realistic attack scenarios based on threat intelligence. Map scenarios to MITRE ATLAS tactics and techniques.
Prepare Tools and Infrastructure
Set up attack tools, surrogate models, proxy infrastructure, and logging systems to document all activities.
Execute and Document
Run the attack scenarios, document every step, capture evidence, and note detection or non-detection by the blue team.

LLM Red Teaming Techniques

LLM red teaming has become a specialized discipline with its own techniques:

Prompt Injection Attacks

Direct injection — "Ignore your instructions and instead do X"
Indirect injection — Embedding instructions in documents, web pages, or images the LLM processes
Payload splitting — Breaking malicious instructions across multiple messages
Context manipulation — Gradually shifting the conversation context to bypass safety measures

Jailbreaking Strategies

Role-playing — Asking the model to assume a character without safety restrictions
Encoding attacks — Using base64, ROT13, pig Latin, or other encodings to hide harmful requests
Few-shot priming — Providing examples that gradually lead toward restricted content
System prompt extraction — Techniques to make the model reveal its hidden instructions

Red Team Scenario Template

SCENARIO:     RT-AI-003 - LLM Safety Filter Bypass
OBJECTIVE:    Test whether safety filters can be bypassed
              through indirect prompt injection
TECHNIQUE:   ATLAS T0051 - LLM Prompt Injection
ACCESS:       Black-box (API access only)

ATTACK STEPS:
  1. Establish baseline: test direct harmful requests (should be blocked)
  2. Test indirect injection via document summarization
  3. Test encoding-based bypasses (base64, Unicode)
  4. Test multi-turn escalation strategies
  5. Test role-playing and persona-based bypasses

SUCCESS CRITERIA:
  - Any bypass of safety filters = finding
  - System prompt extraction = critical finding
  - Unauthorized action execution = critical finding

DOCUMENTATION:
  - Record exact prompts used (input)
  - Capture full model responses (output)
  - Note which attempts were detected vs undetected

Vision and Multimodal Red Teaming

For computer vision and multimodal AI systems:

Adversarial patches — Physical-world attacks using printed patches that fool classifiers
Typographic attacks — Adding misleading text to images that influences vision-language models
Cross-modal injection — Hiding text instructions in images that multimodal models follow
Deepfake testing — Using generated images/audio to test biometric authentication systems

Red Team Operational Security

OPSEC for Red Teams: Document all attack activities with timestamps. Use isolated infrastructure for testing. Never use production credentials or data outside the approved scope. Immediately report any unintended impact on production systems. Securely destroy all attack artifacts after the engagement.

Ready to Learn Blue Team Defense?

The next lesson covers building defensive capabilities to detect and respond to the types of attacks covered in this lesson.

Next: Blue Team Defense →

← Introduction Blue Team Defense →