Advanced

Incident Response & Recovery

Despite all guardrails, incidents will happen. When an AI agent causes infrastructure damage, a structured response is critical. This lesson provides the playbook for containing, recovering from, and learning from AI agent incidents.

Immediate Response: The First 5 Minutes

When you realize an AI agent has caused damage, speed matters. Follow these steps in order:

Stop the Agent Immediately
Kill the agent process. Close the terminal. If using Claude Code, press Ctrl+C immediately. Don't let it continue executing commands.
Revoke Agent Credentials
Immediately revoke or rotate the credentials the agent was using. This prevents further damage if the agent session is still somehow active.
Assess What Happened
Check the agent's command history, terminal output, and audit logs to understand exactly what commands were executed.
Alert the Team
Notify the infrastructure team, on-call engineer, and your manager. Don't try to fix everything alone.

Bash - Emergency Credential Revocation

#!/bin/bash
# emergency-revoke.sh - Run immediately when agent incident detected

echo "🚨 EMERGENCY: Revoking agent credentials..."

# AWS: Deactivate the access key used by the agent
aws iam update-access-key \
  --user-name agent-user \
  --access-key-id AKIA... \
  --status Inactive

# AWS: Revoke all active sessions for the role
aws iam put-role-policy \
  --role-name AgentRole \
  --policy-name EmergencyDeny \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*"
    }]
  }'

# Kubernetes: Delete the agent's service account token
kubectl delete secret agent-sa-token -n default

# Git: Revoke the agent's deploy key or PAT
gh api -X DELETE /repos/org/repo/keys/KEY_ID

echo "✅ Agent credentials revoked."
echo "   Review CloudTrail/audit logs for full damage assessment."

⚠

Don't panic-fix: Resist the urge to immediately run recovery commands. First, fully understand the blast radius. Running more commands in a panic can make things worse.

Assessment: Determining Blast Radius

Before attempting recovery, you need to understand the full scope of damage. Use audit logs to trace every action the agent took:

Bash - AWS CloudTrail: Trace Agent Actions

# Find all API calls made by the agent's credentials in the last hour
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=AccessKeyId,AttributeValue=AKIA... \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --query 'Events[].{Time:EventTime,Name:EventName,Resource:Resources[0].ResourceName}' \
  --output table

# For more detail, search CloudTrail logs in S3
aws s3 ls s3://cloudtrail-bucket/AWSLogs/ --recursive \
  | grep $(date +%Y/%m/%d)

# Azure: Query Activity Log
az monitor activity-log list \
  --caller agent@company.com \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --query '[].{Time:eventTimestamp,Op:operationName.value,Status:status.value}'

# GCP: Query Audit Logs
gcloud logging read \
  'protoPayload.authenticationInfo.principalEmail="agent-sa@project.iam.gserviceaccount.com"' \
  --freshness=1h \
  --format='table(timestamp,protoPayload.methodName,protoPayload.resourceName)'

Recovery from Backups and Snapshots

Once you understand the blast radius, begin recovery using your backup and snapshot strategy:

Resource Type	Recovery Method	Typical RTO
RDS Database	Point-in-time restore from automated backups	15-30 minutes
S3 Objects	Restore from versioning or cross-region replication	Minutes to hours
EC2 Instances	Launch from AMI backups or EBS snapshots	5-15 minutes
DynamoDB Tables	Point-in-time recovery (PITR) or on-demand backup	Minutes
Kubernetes Resources	Re-apply from git (GitOps) or Velero backup	5-10 minutes
Terraform State	Restore from state backup in S3	Minutes
Git Repository	Recover from reflog, restore force-pushed commits	Minutes

Bash - Common Recovery Commands

# Restore RDS from point-in-time
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier production-db \
  --target-db-instance-identifier production-db-restored \
  --restore-time "2026-03-20T10:30:00Z" \
  --db-instance-class db.r6g.large

# Restore S3 object from versioning
aws s3api list-object-versions \
  --bucket my-bucket \
  --prefix important-data/ \
  --query 'Versions[?IsLatest==`false`]'

# Recover deleted objects using delete markers
aws s3api delete-object \
  --bucket my-bucket \
  --key important-data/file.json \
  --version-id "DELETE_MARKER_VERSION_ID"

# Restore Kubernetes from Velero backup
velero restore create --from-backup daily-backup-20260320

# Git: Recover force-pushed branch
git reflog show origin/main
git push origin RECOVERED_SHA:main

Post-Incident Review

After recovery, conduct a blameless post-incident review. The goal is to understand what happened and prevent recurrence:

Markdown - Post-Incident Review Template

# AI Agent Incident Report

## Incident Summary
- **Date/Time:** 2026-03-20 10:30 UTC
- **Duration:** 45 minutes (detection to recovery)
- **Severity:** SEV-2
- **Agent:** Claude Code v1.x
- **Developer:** [name]

## What Happened
1. Developer asked agent to "clean up unused resources in staging"
2. Agent identified resources based on tags and last-activity timestamps
3. Agent ran `terraform destroy` targeting a module that included
   shared infrastructure used by both staging and production
4. Production API gateway was destroyed along with staging resources

## Impact
- Production API was down for 30 minutes
- ~2,000 API requests failed during the outage
- No data loss (databases were on separate infrastructure)

## Root Cause
- The Terraform module bundled staging and production resources together
- Agent had production-level credentials (not least-privilege)
- No guardrail script was in place to block `terraform destroy`

## What Went Well
- CloudTrail alert fired within 2 minutes
- RDS was unaffected due to deletion protection
- Recovery from Terraform state took only 15 minutes

## Action Items
- [ ] Separate staging and production into distinct Terraform modules
- [ ] Implement agent-specific IAM role with no delete permissions
- [ ] Add guardrail script to block `terraform destroy` commands
- [ ] Configure deletion protection on all production resources
- [ ] Add CLAUDE.md rules to forbid destroy commands

Building Runbooks for AI Agent Incidents

Pre-built runbooks ensure consistent, fast responses. Create runbooks for common agent incident scenarios:

✅

Runbook structure: Each runbook should answer: (1) How do I detect this? (2) How do I contain it? (3) How do I recover? (4) Who do I notify? (5) What's the follow-up?

Runbook	Trigger	Key Steps
Agent Deleted Cloud Resources	CloudTrail alert for Delete* API calls	Revoke creds, assess blast radius, restore from backup/snapshot
Agent Force-Pushed to Main	Git webhook for force push events	Recover from reflog, notify team, revert
Agent Exposed Secrets	Secret scanning alert, commit with credentials	Rotate all exposed credentials, git filter-branch to remove
Agent Caused Cost Spike	AWS Budget alert threshold exceeded	Identify resources, terminate/scale down, review bill
Agent Modified Prod Config	Config drift detection alert	Revert config, restart services, verify health

Communication Templates

Have pre-written templates ready for stakeholder communication during an incident:

Text - Incident Communication Templates

--- INITIAL NOTIFICATION (within 5 minutes) ---

Subject: [INCIDENT] Production impact from AI agent action

Team,

We've identified a production incident caused by an AI coding agent.
- **Impact:** [describe user-facing impact]
- **Status:** Investigating and containing
- **ETA for update:** 15 minutes

We have revoked the agent's credentials and are assessing blast radius.

--- STATUS UPDATE (every 15-30 minutes) ---

Subject: [INCIDENT UPDATE] [service] - Recovery in progress

- **Current status:** Recovery in progress
- **Actions taken:** [list actions]
- **ETA for resolution:** [time estimate]
- **Workarounds:** [if any]

--- RESOLUTION ---

Subject: [RESOLVED] Production incident from AI agent action

The incident has been resolved.
- **Duration:** [total time]
- **Root cause:** AI agent executed [command] which [effect]
- **Resolution:** Restored from [backup/snapshot/revert]
- **Follow-up:** Post-incident review scheduled for [date]

Lessons Learned from Real Incidents

Common patterns seen across AI agent incidents:

Ambiguous instructions are dangerous: "Clean up old resources" is interpreted differently by every agent. Be specific: "Delete EC2 instances tagged env=test that have been stopped for 30+ days."
Agents don't understand blast radius: An agent doesn't know that deleting "that one S3 bucket" means losing 5TB of customer data. Always require explicit confirmation for delete operations.
Shared infrastructure is the biggest risk: Resources shared between environments (staging/production) are the most common source of agent-caused outages.
Recovery time depends on preparation: Teams with backups, snapshots, and deletion protection recovered in minutes. Teams without took hours or days.
The cost of guardrails is always less than the cost of an outage: Even a 30-minute production outage costs more than setting up proper agent safety.

← Previous CI/CD Safety Next → Best Practices

Incident Response & Recovery

Immediate Response: The First 5 Minutes

Stop the Agent Immediately

Revoke Agent Credentials

Assess What Happened

Alert the Team