Advanced

GCP Backup & Recovery

Guardrails can prevent most accidental deletions, but no defense is perfect. Backups ensure you can recover even when all other guardrails fail. This lesson covers GCP-native backup services and recovery procedures.

Google Cloud Backup and DR Service

Google Cloud Backup and DR is a managed backup service that provides centralized backup management across GCP services:

gcloud - Backup and DR Service setup
# Enable the Backup and DR API
gcloud services enable backupdr.googleapis.com --project=my-project

# Create a backup vault (storage for backups)
gcloud backup-dr backup-vaults create prod-backup-vault \
  --location=us-central1 \
  --backup-minimum-enforced-retention-duration=604800s \
  --description="Production backup vault for AI agent-managed resources"

# Create a backup plan
gcloud backup-dr backup-plans create daily-backup-plan \
  --location=us-central1 \
  --backup-vault=prod-backup-vault \
  --resource-type=compute.googleapis.com/Instance \
  --backup-rule-id=daily-rule \
  --retention-days=30 \
  --recurrence="FREQ=DAILY;BYHOUR=2;BYMINUTE=0"
💡
Enforced retention: The backup-minimum-enforced-retention-duration setting prevents anyone (including agents) from deleting backups before the retention period expires. This is your guarantee that backups will be available when needed.

Persistent Disk Snapshots and Schedules

Persistent Disk snapshots are the foundation of Compute Engine backup. Schedule them to run automatically:

gcloud - Snapshot schedules
# Create a snapshot schedule policy
gcloud compute resource-policies create snapshot-schedule daily-snapshots \
  --region=us-central1 \
  --max-retention-days=30 \
  --on-source-disk-delete=apply-retention-policy \
  --daily-schedule \
  --start-time=02:00 \
  --storage-location=us

# Attach the schedule to a persistent disk
gcloud compute disks add-resource-policies prod-web-server \
  --zone=us-central1-a \
  --resource-policies=daily-snapshots

# Take a manual snapshot before risky operations
gcloud compute disks snapshot prod-web-server \
  --zone=us-central1-a \
  --snapshot-names=pre-agent-operation-$(date +%Y%m%d-%H%M%S) \
  --storage-location=us

# List snapshots
gcloud compute snapshots list \
  --filter="sourceDisk:prod-web-server" \
  --sort-by=~creationTimestamp
gcloud - Restore from snapshot
# Create a new disk from a snapshot
gcloud compute disks create restored-prod-disk \
  --zone=us-central1-a \
  --source-snapshot=pre-agent-operation-20260320-140000

# Create a new instance using the restored disk
gcloud compute instances create restored-prod-server \
  --zone=us-central1-a \
  --machine-type=e2-medium \
  --disk=name=restored-prod-disk,boot=yes \
  --deletion-protection

Cloud SQL Automated Backups and PITR

Cloud SQL offers automated daily backups and point-in-time recovery (PITR) for precise restoration:

gcloud - Cloud SQL backup configuration
# Configure automated backups with PITR
gcloud sql instances patch prod-database \
  --backup-start-time=02:00 \
  --enable-bin-log \
  --enable-point-in-time-recovery \
  --retained-backups-count=30 \
  --retained-transaction-log-days=7

# Take an on-demand backup before risky operations
gcloud sql backups create \
  --instance=prod-database \
  --description="Pre-agent-operation backup"

# List available backups
gcloud sql backups list --instance=prod-database

# Restore to a specific point in time
gcloud sql instances clone prod-database prod-database-restored \
  --point-in-time="2026-03-20T14:00:00.000Z"

# Restore from a specific backup
gcloud sql backups restore BACKUP_ID \
  --restore-instance=prod-database
Terraform - Cloud SQL with comprehensive backups
resource "google_sql_database_instance" "main" {
  name             = "prod-database"
  database_version = "POSTGRES_15"
  region           = "us-central1"

  settings {
    tier = "db-custom-4-16384"
    deletion_protection_enabled = true

    backup_configuration {
      enabled                        = true
      point_in_time_recovery_enabled = true
      start_time                     = "02:00"
      transaction_log_retention_days = 7

      backup_retention_settings {
        retained_backups = 30
        retention_unit   = "COUNT"
      }
    }
  }

  deletion_protection = true

  lifecycle {
    prevent_destroy = true
  }
}

GCS Versioning and Lifecycle Policies

GCS versioning keeps previous versions of objects, allowing recovery after accidental overwrites or deletions:

gcloud - GCS versioning and lifecycle
# Enable versioning
gcloud storage buckets update gs://company-critical-data --versioning

# Set a lifecycle rule: delete noncurrent versions after 90 days
# (keeps costs manageable while providing recovery window)
gcloud storage buckets update gs://company-critical-data \
  --lifecycle-file=lifecycle.json
JSON - lifecycle.json
{
  "rule": [
    {
      "action": {"type": "Delete"},
      "condition": {
        "isLive": false,
        "daysSinceNoncurrentTime": 90
      }
    },
    {
      "action": {"type": "SetStorageClass", "storageClass": "NEARLINE"},
      "condition": {
        "isLive": false,
        "daysSinceNoncurrentTime": 30
      }
    }
  ]
}
gcloud - Recover deleted GCS objects
# List all versions of an object (including deleted)
gcloud storage objects list gs://company-critical-data/important.csv \
  --all-versions

# Restore a specific previous version
gcloud storage cp \
  gs://company-critical-data/important.csv#1234567890123456 \
  gs://company-critical-data/important.csv

# Recover from soft delete (if enabled)
gcloud storage objects restore gs://company-critical-data/deleted-file.csv \
  --generation=1234567890123456

Cross-Region Backup Strategies

For disaster recovery, replicate backups across regions:

ServiceCross-Region StrategyConfiguration
Compute EngineMulti-region snapshot storage--storage-location=us (multi-region)
Cloud SQLCross-region read replicas + backupsCreate replica in another region
Cloud StorageDual-region or multi-region buckets--location=US or --location=NAM4
BigQueryCross-region dataset copiesbq cp across regions
GKEBackup for GKE (cross-region restore)Backup plan with remote target
gcloud - Cross-region Cloud SQL replica
# Create a cross-region read replica (also serves as backup)
gcloud sql instances create prod-database-replica \
  --master-instance-name=prod-database \
  --region=us-east1 \
  --tier=db-custom-4-16384 \
  --deletion-protection

# In case of primary failure, promote the replica
gcloud sql instances promote-replica prod-database-replica

Recovery Procedures After Agent-Caused Deletion

If an agent manages to delete a resource despite guardrails, follow these recovery steps:

  1. Identify What Was Deleted

    Check Cloud Audit Logs immediately. Filter for the agent's service account and the time window. The audit log entry contains the exact resource name, method, and result.

  2. Check for Soft Delete / Grace Period

    Many GCP resources have a soft-delete grace period. Projects have a 30-day recovery window. Cloud Storage with soft delete keeps objects for the configured duration. Act quickly.

  3. Restore from Backup

    If the resource is permanently deleted, restore from your most recent backup: disk snapshots for VMs, Cloud SQL backups for databases, GCS versioning for storage objects.

  4. Verify and Validate

    After restoration, verify that the recovered resource matches the expected state. Run application health checks, validate data integrity, and confirm connectivity.

  5. Close the Gap

    Investigate how the agent bypassed your guardrails. Tighten IAM permissions, add missing deny policies, enable deletion protection on the recovered resource, and update your monitoring.

gcloud - Recover a deleted project (within 30 days)
# List projects pending deletion
gcloud projects list --filter="lifecycleState=DELETE_REQUESTED"

# Restore a project (within 30-day grace period)
gcloud projects undelete prod-web-app

# Immediately add a lien to prevent re-deletion
gcloud alpha resource-manager liens create \
  --project=prod-web-app \
  --restrictions=resourcemanager.projects.delete \
  --reason="Recovered from accidental deletion - now protected"

Disaster Recovery Testing

Regularly test your backup and recovery procedures to ensure they work when needed:

DR testing checklist:
  • Test restoring a VM from a snapshot in a separate project
  • Test Cloud SQL point-in-time recovery to a clone
  • Test GCS object recovery from versioning
  • Test project undelete in a sandbox environment
  • Measure Recovery Time Objective (RTO) for each service
  • Verify data integrity after each recovery test
  • Document the procedure and keep it updated
  • Run DR tests at least quarterly