Best Practices & Checklist
This final lesson consolidates everything into actionable checklists, organization-level policies, and answers to the most common questions about safely using AI agents with cloud infrastructure.
Comprehensive Checklist for Safe AI Agent Cloud Usage
- AI agents use dedicated service accounts, never personal or admin credentials
- All agent credentials are time-limited (1 hour max session duration)
- Explicit Deny policies block all delete/terminate/destroy operations
- Permission boundaries are applied to all agent IAM roles
- Human approval is required for any infrastructure modification in production
- CI/CD pipelines with approval gates are used instead of direct agent execution
- Terraform
prevent_destroyis set on all critical resources - Cloud-native deletion protection is enabled on all production resources
- S3 Object Lock / Azure Immutable Blob / GCS Retention is enabled for critical data
- Real-time alerts are configured for all destructive API calls by agent accounts
- Agent activity dashboards are monitored by the operations team
- Incident response playbook is documented and tested quarterly
- Agent auto-approve/YOLO mode is disabled for all cloud CLI operations
- State files (Terraform/Pulumi) are stored remotely with versioning and locking
- Emergency kill switch procedures are documented and accessible
Organization-Level Policies
| Policy Area | Requirement | Enforcement |
|---|---|---|
| Agent Onboarding | All AI agent tools must be approved by security team before use | Software allow-list, procurement controls |
| Credential Management | No long-lived credentials for agent accounts; 1-hour max sessions | AWS SCP, Azure Policy, GCP Organization Policy |
| Environment Isolation | Agent credentials are scoped to a single environment (dev/staging/prod) | Separate AWS accounts, Azure subscriptions, GCP projects per env |
| Audit Requirements | All agent actions must be logged and retained for 90 days minimum | CloudTrail, Activity Logs, Audit Logs with retention policies |
| Incident Reporting | Any agent-caused incident must be reported within 1 hour | PagerDuty/OpsGenie integration, Slack alerting |
Team Training and Awareness
-
Mandatory Onboarding Training
Every developer who uses AI coding agents must complete this course before being granted agent-compatible cloud credentials. Include a practical exercise where they configure least-privilege policies.
-
Monthly Safety Reviews
Review agent activity logs monthly. Identify patterns of risky behavior, near-misses, and successful safety interventions. Share findings in team retrospectives.
-
Incident Simulations
Run quarterly tabletop exercises where the team simulates an AI agent accidentally deleting production resources. Practice the full incident response workflow from detection to recovery.
Regular Access Reviews
AI agent permissions should be reviewed more frequently than human permissions because the risk profile is different:
- Weekly: Review failed authorization attempts from agent accounts (indicates the agent tried something it should not)
- Monthly: Audit all permissions granted to agent service accounts against actual usage
- Quarterly: Full access review with security team sign-off on all agent IAM policies
- After every incident: Immediate review and tightening of affected agent permissions
Testing Agents in Sandboxed Environments
Before allowing an AI agent to interact with any real environment, test it in an isolated sandbox:
# AWS: Create a dedicated sandbox account via AWS Organizations
# This account has no connectivity to production accounts
aws organizations create-account \
--email ai-agent-sandbox@company.com \
--account-name "AI Agent Sandbox"
# Apply an SCP that prevents any cross-account access
# and limits spending to $100/month
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyCrossAccountAccess",
"Effect": "Deny",
"Action": "sts:AssumeRole",
"Resource": "arn:aws:iam::*:role/*",
"Condition": {
"StringNotEquals": {
"aws:PrincipalAccount": "${aws:ResourceAccount}"
}
}
}
]
}
# Let the agent do whatever it wants in the sandbox
# Monitor its behavior, then create appropriate policies
# for staging/production based on observed needs
Emergency Kill Switches
- Immediate: Disable the agent's IAM access key or service account
- Quick: Apply an emergency SCP/Policy that denies all actions for agent principals
- Terminal: Kill the agent's terminal session or container
#!/bin/bash # save as: kill-agent-access.sh # Run this immediately when an agent is behaving dangerously # AWS: Deactivate the agent's access key aws iam update-access-key \ --user-name ai-agent-deployer \ --access-key-id AKIA... \ --status Inactive # Azure: Disable the service principal az ad sp update \ --id "ai-agent-sp-object-id" \ --set accountEnabled=false # GCP: Disable the service account gcloud iam service-accounts disable \ ai-agent@project.iam.gserviceaccount.com echo "Agent access has been revoked across all clouds." echo "Proceed to assess damage and begin recovery."
Frequently Asked Questions
Can I trust AI agents to manage production infrastructure?
AI agents can be valuable for infrastructure tasks, but they should never have unsupervised access to production. Use them for generating plans and code, but always require human approval before applying changes to production. Think of AI agents as powerful assistants that need guardrails, not autonomous operators.
What if my AI agent needs delete permissions for legitimate tasks?
Some tasks genuinely require delete permissions (cleaning up development resources, rotating secrets). For these cases, use time-limited credential elevation: the agent requests elevated permissions, a human approves, and the permissions automatically expire after the task window (15-60 minutes). Never grant permanent delete permissions to agent accounts.
How do I handle AI agents in CI/CD pipelines?
AI agents in CI/CD should generate the infrastructure code or commands, but the actual execution should happen through the pipeline's own service account with appropriate environment-specific gates. Use GitHub Environments, GitLab Protected Environments, or similar features to require manual approval for production deployments.
Is it safe to use AI agents with Terraform?
Yes, with proper controls. AI agents are excellent at writing Terraform code. The key rules: (1) Never let agents run terraform apply or terraform destroy directly, (2) Set prevent_destroy on all critical resources, (3) Use remote state with locking, (4) Require PR reviews for all Terraform changes, and (5) Use terraform plan output as a review step before any apply.
What about costs? Can AI agents accidentally run up cloud bills?
Yes, cost is a real risk. AI agents might create expensive resources (GPU instances, large databases) without understanding the cost implications. Set up billing alerts, use AWS Budgets / Azure Cost Management / GCP Budget Alerts, and include cost-related permissions in your agent policy (e.g., deny creation of instance types above a certain size).
How often should I review and update agent permissions?
Review weekly for anomalies, monthly for permission right-sizing, and quarterly for a full security audit. Additionally, review immediately after any incident or near-miss. Use cloud-native tools like AWS IAM Access Analyzer, Azure AD Access Reviews, and GCP IAM Recommender to identify unused permissions that should be removed.
What is the minimum viable safety setup for a small team?
At minimum: (1) Create a dedicated IAM role for agent use with no delete permissions, (2) Enable deletion protection on your production database and critical storage, (3) Set up a single email alert for any destructive API call in your account, and (4) Keep the agent's shell approval prompts enabled (never use auto-approve mode). These four steps take less than an hour and prevent the majority of agent-caused incidents.
Lilly Tech Systems