Advanced

Incident Response & Recovery

Despite all guardrails, incidents will happen. When an AI agent causes infrastructure damage, a structured response is critical. This lesson provides the playbook for containing, recovering from, and learning from AI agent incidents.

Immediate Response: The First 5 Minutes

When you realize an AI agent has caused damage, speed matters. Follow these steps in order:

  1. Stop the Agent Immediately

    Kill the agent process. Close the terminal. If using Claude Code, press Ctrl+C immediately. Don't let it continue executing commands.

  2. Revoke Agent Credentials

    Immediately revoke or rotate the credentials the agent was using. This prevents further damage if the agent session is still somehow active.

  3. Assess What Happened

    Check the agent's command history, terminal output, and audit logs to understand exactly what commands were executed.

  4. Alert the Team

    Notify the infrastructure team, on-call engineer, and your manager. Don't try to fix everything alone.

Bash - Emergency Credential Revocation
#!/bin/bash
# emergency-revoke.sh - Run immediately when agent incident detected

echo "🚨 EMERGENCY: Revoking agent credentials..."

# AWS: Deactivate the access key used by the agent
aws iam update-access-key \
  --user-name agent-user \
  --access-key-id AKIA... \
  --status Inactive

# AWS: Revoke all active sessions for the role
aws iam put-role-policy \
  --role-name AgentRole \
  --policy-name EmergencyDeny \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*"
    }]
  }'

# Kubernetes: Delete the agent's service account token
kubectl delete secret agent-sa-token -n default

# Git: Revoke the agent's deploy key or PAT
gh api -X DELETE /repos/org/repo/keys/KEY_ID

echo "✅ Agent credentials revoked."
echo "   Review CloudTrail/audit logs for full damage assessment."
Don't panic-fix: Resist the urge to immediately run recovery commands. First, fully understand the blast radius. Running more commands in a panic can make things worse.

Assessment: Determining Blast Radius

Before attempting recovery, you need to understand the full scope of damage. Use audit logs to trace every action the agent took:

Bash - AWS CloudTrail: Trace Agent Actions
# Find all API calls made by the agent's credentials in the last hour
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=AccessKeyId,AttributeValue=AKIA... \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --query 'Events[].{Time:EventTime,Name:EventName,Resource:Resources[0].ResourceName}' \
  --output table

# For more detail, search CloudTrail logs in S3
aws s3 ls s3://cloudtrail-bucket/AWSLogs/ --recursive \
  | grep $(date +%Y/%m/%d)

# Azure: Query Activity Log
az monitor activity-log list \
  --caller agent@company.com \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --query '[].{Time:eventTimestamp,Op:operationName.value,Status:status.value}'

# GCP: Query Audit Logs
gcloud logging read \
  'protoPayload.authenticationInfo.principalEmail="agent-sa@project.iam.gserviceaccount.com"' \
  --freshness=1h \
  --format='table(timestamp,protoPayload.methodName,protoPayload.resourceName)'

Recovery from Backups and Snapshots

Once you understand the blast radius, begin recovery using your backup and snapshot strategy:

Resource TypeRecovery MethodTypical RTO
RDS DatabasePoint-in-time restore from automated backups15-30 minutes
S3 ObjectsRestore from versioning or cross-region replicationMinutes to hours
EC2 InstancesLaunch from AMI backups or EBS snapshots5-15 minutes
DynamoDB TablesPoint-in-time recovery (PITR) or on-demand backupMinutes
Kubernetes ResourcesRe-apply from git (GitOps) or Velero backup5-10 minutes
Terraform StateRestore from state backup in S3Minutes
Git RepositoryRecover from reflog, restore force-pushed commitsMinutes
Bash - Common Recovery Commands
# Restore RDS from point-in-time
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier production-db \
  --target-db-instance-identifier production-db-restored \
  --restore-time "2026-03-20T10:30:00Z" \
  --db-instance-class db.r6g.large

# Restore S3 object from versioning
aws s3api list-object-versions \
  --bucket my-bucket \
  --prefix important-data/ \
  --query 'Versions[?IsLatest==`false`]'

# Recover deleted objects using delete markers
aws s3api delete-object \
  --bucket my-bucket \
  --key important-data/file.json \
  --version-id "DELETE_MARKER_VERSION_ID"

# Restore Kubernetes from Velero backup
velero restore create --from-backup daily-backup-20260320

# Git: Recover force-pushed branch
git reflog show origin/main
git push origin RECOVERED_SHA:main

Post-Incident Review

After recovery, conduct a blameless post-incident review. The goal is to understand what happened and prevent recurrence:

Markdown - Post-Incident Review Template
# AI Agent Incident Report

## Incident Summary
- **Date/Time:** 2026-03-20 10:30 UTC
- **Duration:** 45 minutes (detection to recovery)
- **Severity:** SEV-2
- **Agent:** Claude Code v1.x
- **Developer:** [name]

## What Happened
1. Developer asked agent to "clean up unused resources in staging"
2. Agent identified resources based on tags and last-activity timestamps
3. Agent ran `terraform destroy` targeting a module that included
   shared infrastructure used by both staging and production
4. Production API gateway was destroyed along with staging resources

## Impact
- Production API was down for 30 minutes
- ~2,000 API requests failed during the outage
- No data loss (databases were on separate infrastructure)

## Root Cause
- The Terraform module bundled staging and production resources together
- Agent had production-level credentials (not least-privilege)
- No guardrail script was in place to block `terraform destroy`

## What Went Well
- CloudTrail alert fired within 2 minutes
- RDS was unaffected due to deletion protection
- Recovery from Terraform state took only 15 minutes

## Action Items
- [ ] Separate staging and production into distinct Terraform modules
- [ ] Implement agent-specific IAM role with no delete permissions
- [ ] Add guardrail script to block `terraform destroy` commands
- [ ] Configure deletion protection on all production resources
- [ ] Add CLAUDE.md rules to forbid destroy commands

Building Runbooks for AI Agent Incidents

Pre-built runbooks ensure consistent, fast responses. Create runbooks for common agent incident scenarios:

Runbook structure: Each runbook should answer: (1) How do I detect this? (2) How do I contain it? (3) How do I recover? (4) Who do I notify? (5) What's the follow-up?
RunbookTriggerKey Steps
Agent Deleted Cloud ResourcesCloudTrail alert for Delete* API callsRevoke creds, assess blast radius, restore from backup/snapshot
Agent Force-Pushed to MainGit webhook for force push eventsRecover from reflog, notify team, revert
Agent Exposed SecretsSecret scanning alert, commit with credentialsRotate all exposed credentials, git filter-branch to remove
Agent Caused Cost SpikeAWS Budget alert threshold exceededIdentify resources, terminate/scale down, review bill
Agent Modified Prod ConfigConfig drift detection alertRevert config, restart services, verify health

Communication Templates

Have pre-written templates ready for stakeholder communication during an incident:

Text - Incident Communication Templates
--- INITIAL NOTIFICATION (within 5 minutes) ---

Subject: [INCIDENT] Production impact from AI agent action

Team,

We've identified a production incident caused by an AI coding agent.
- **Impact:** [describe user-facing impact]
- **Status:** Investigating and containing
- **ETA for update:** 15 minutes

We have revoked the agent's credentials and are assessing blast radius.

--- STATUS UPDATE (every 15-30 minutes) ---

Subject: [INCIDENT UPDATE] [service] - Recovery in progress

- **Current status:** Recovery in progress
- **Actions taken:** [list actions]
- **ETA for resolution:** [time estimate]
- **Workarounds:** [if any]

--- RESOLUTION ---

Subject: [RESOLVED] Production incident from AI agent action

The incident has been resolved.
- **Duration:** [total time]
- **Root cause:** AI agent executed [command] which [effect]
- **Resolution:** Restored from [backup/snapshot/revert]
- **Follow-up:** Post-incident review scheduled for [date]

Lessons Learned from Real Incidents

Common patterns seen across AI agent incidents:

  • Ambiguous instructions are dangerous: "Clean up old resources" is interpreted differently by every agent. Be specific: "Delete EC2 instances tagged env=test that have been stopped for 30+ days."
  • Agents don't understand blast radius: An agent doesn't know that deleting "that one S3 bucket" means losing 5TB of customer data. Always require explicit confirmation for delete operations.
  • Shared infrastructure is the biggest risk: Resources shared between environments (staging/production) are the most common source of agent-caused outages.
  • Recovery time depends on preparation: Teams with backups, snapshots, and deletion protection recovered in minutes. Teams without took hours or days.
  • The cost of guardrails is always less than the cost of an outage: Even a 30-minute production outage costs more than setting up proper agent safety.