GCP Backup & Recovery
Guardrails can prevent most accidental deletions, but no defense is perfect. Backups ensure you can recover even when all other guardrails fail. This lesson covers GCP-native backup services and recovery procedures.
Google Cloud Backup and DR Service
Google Cloud Backup and DR is a managed backup service that provides centralized backup management across GCP services:
# Enable the Backup and DR API gcloud services enable backupdr.googleapis.com --project=my-project # Create a backup vault (storage for backups) gcloud backup-dr backup-vaults create prod-backup-vault \ --location=us-central1 \ --backup-minimum-enforced-retention-duration=604800s \ --description="Production backup vault for AI agent-managed resources" # Create a backup plan gcloud backup-dr backup-plans create daily-backup-plan \ --location=us-central1 \ --backup-vault=prod-backup-vault \ --resource-type=compute.googleapis.com/Instance \ --backup-rule-id=daily-rule \ --retention-days=30 \ --recurrence="FREQ=DAILY;BYHOUR=2;BYMINUTE=0"
backup-minimum-enforced-retention-duration setting prevents anyone (including agents) from deleting backups before the retention period expires. This is your guarantee that backups will be available when needed.Persistent Disk Snapshots and Schedules
Persistent Disk snapshots are the foundation of Compute Engine backup. Schedule them to run automatically:
# Create a snapshot schedule policy gcloud compute resource-policies create snapshot-schedule daily-snapshots \ --region=us-central1 \ --max-retention-days=30 \ --on-source-disk-delete=apply-retention-policy \ --daily-schedule \ --start-time=02:00 \ --storage-location=us # Attach the schedule to a persistent disk gcloud compute disks add-resource-policies prod-web-server \ --zone=us-central1-a \ --resource-policies=daily-snapshots # Take a manual snapshot before risky operations gcloud compute disks snapshot prod-web-server \ --zone=us-central1-a \ --snapshot-names=pre-agent-operation-$(date +%Y%m%d-%H%M%S) \ --storage-location=us # List snapshots gcloud compute snapshots list \ --filter="sourceDisk:prod-web-server" \ --sort-by=~creationTimestamp
# Create a new disk from a snapshot gcloud compute disks create restored-prod-disk \ --zone=us-central1-a \ --source-snapshot=pre-agent-operation-20260320-140000 # Create a new instance using the restored disk gcloud compute instances create restored-prod-server \ --zone=us-central1-a \ --machine-type=e2-medium \ --disk=name=restored-prod-disk,boot=yes \ --deletion-protection
Cloud SQL Automated Backups and PITR
Cloud SQL offers automated daily backups and point-in-time recovery (PITR) for precise restoration:
# Configure automated backups with PITR gcloud sql instances patch prod-database \ --backup-start-time=02:00 \ --enable-bin-log \ --enable-point-in-time-recovery \ --retained-backups-count=30 \ --retained-transaction-log-days=7 # Take an on-demand backup before risky operations gcloud sql backups create \ --instance=prod-database \ --description="Pre-agent-operation backup" # List available backups gcloud sql backups list --instance=prod-database # Restore to a specific point in time gcloud sql instances clone prod-database prod-database-restored \ --point-in-time="2026-03-20T14:00:00.000Z" # Restore from a specific backup gcloud sql backups restore BACKUP_ID \ --restore-instance=prod-database
resource "google_sql_database_instance" "main" {
name = "prod-database"
database_version = "POSTGRES_15"
region = "us-central1"
settings {
tier = "db-custom-4-16384"
deletion_protection_enabled = true
backup_configuration {
enabled = true
point_in_time_recovery_enabled = true
start_time = "02:00"
transaction_log_retention_days = 7
backup_retention_settings {
retained_backups = 30
retention_unit = "COUNT"
}
}
}
deletion_protection = true
lifecycle {
prevent_destroy = true
}
}
GCS Versioning and Lifecycle Policies
GCS versioning keeps previous versions of objects, allowing recovery after accidental overwrites or deletions:
# Enable versioning gcloud storage buckets update gs://company-critical-data --versioning # Set a lifecycle rule: delete noncurrent versions after 90 days # (keeps costs manageable while providing recovery window) gcloud storage buckets update gs://company-critical-data \ --lifecycle-file=lifecycle.json
{
"rule": [
{
"action": {"type": "Delete"},
"condition": {
"isLive": false,
"daysSinceNoncurrentTime": 90
}
},
{
"action": {"type": "SetStorageClass", "storageClass": "NEARLINE"},
"condition": {
"isLive": false,
"daysSinceNoncurrentTime": 30
}
}
]
}
# List all versions of an object (including deleted) gcloud storage objects list gs://company-critical-data/important.csv \ --all-versions # Restore a specific previous version gcloud storage cp \ gs://company-critical-data/important.csv#1234567890123456 \ gs://company-critical-data/important.csv # Recover from soft delete (if enabled) gcloud storage objects restore gs://company-critical-data/deleted-file.csv \ --generation=1234567890123456
Cross-Region Backup Strategies
For disaster recovery, replicate backups across regions:
| Service | Cross-Region Strategy | Configuration |
|---|---|---|
| Compute Engine | Multi-region snapshot storage | --storage-location=us (multi-region) |
| Cloud SQL | Cross-region read replicas + backups | Create replica in another region |
| Cloud Storage | Dual-region or multi-region buckets | --location=US or --location=NAM4 |
| BigQuery | Cross-region dataset copies | bq cp across regions |
| GKE | Backup for GKE (cross-region restore) | Backup plan with remote target |
# Create a cross-region read replica (also serves as backup) gcloud sql instances create prod-database-replica \ --master-instance-name=prod-database \ --region=us-east1 \ --tier=db-custom-4-16384 \ --deletion-protection # In case of primary failure, promote the replica gcloud sql instances promote-replica prod-database-replica
Recovery Procedures After Agent-Caused Deletion
If an agent manages to delete a resource despite guardrails, follow these recovery steps:
Identify What Was Deleted
Check Cloud Audit Logs immediately. Filter for the agent's service account and the time window. The audit log entry contains the exact resource name, method, and result.
Check for Soft Delete / Grace Period
Many GCP resources have a soft-delete grace period. Projects have a 30-day recovery window. Cloud Storage with soft delete keeps objects for the configured duration. Act quickly.
Restore from Backup
If the resource is permanently deleted, restore from your most recent backup: disk snapshots for VMs, Cloud SQL backups for databases, GCS versioning for storage objects.
Verify and Validate
After restoration, verify that the recovered resource matches the expected state. Run application health checks, validate data integrity, and confirm connectivity.
Close the Gap
Investigate how the agent bypassed your guardrails. Tighten IAM permissions, add missing deny policies, enable deletion protection on the recovered resource, and update your monitoring.
# List projects pending deletion gcloud projects list --filter="lifecycleState=DELETE_REQUESTED" # Restore a project (within 30-day grace period) gcloud projects undelete prod-web-app # Immediately add a lien to prevent re-deletion gcloud alpha resource-manager liens create \ --project=prod-web-app \ --restrictions=resourcemanager.projects.delete \ --reason="Recovered from accidental deletion - now protected"
Disaster Recovery Testing
Regularly test your backup and recovery procedures to ensure they work when needed:
- Test restoring a VM from a snapshot in a separate project
- Test Cloud SQL point-in-time recovery to a clone
- Test GCS object recovery from versioning
- Test project undelete in a sandbox environment
- Measure Recovery Time Objective (RTO) for each service
- Verify data integrity after each recovery test
- Document the procedure and keep it updated
- Run DR tests at least quarterly
Lilly Tech Systems