Advanced

Recovery

Recovery in AI systems goes beyond restoring service. It may require retraining models, remediating data, and rebuilding trust — processes that can take days or weeks rather than minutes.

Model Retraining and Remediation

When the root cause is in the model itself (poisoned data, faulty fine-tuning, or discovered backdoors), recovery requires retraining:

  1. Root Cause Identification

    Before retraining, determine exactly what caused the incident. Was it training data contamination, a configuration error, a supply chain compromise, or adversarial exploitation? The fix must address the root cause.

  2. Data Remediation

    If training data was compromised, identify and remove poisoned samples. Verify the integrity of remaining data. This may require re-auditing the entire training dataset.

  3. Model Retraining

    Retrain the model on clean data with verified integrity. Run extended evaluation suites including adversarial test cases derived from the incident.

  4. Validation and Testing

    Validate the new model specifically against the failure mode that caused the incident. Create regression tests that would catch this specific issue.

Retraining Time: Large model retraining can take days to weeks and cost thousands of dollars in compute. Plan for this in your recovery timeline. For LLMs served via API, coordinate with your provider for model-level fixes.

Gradual Traffic Restoration

Phase Traffic % Duration Exit Criteria
Canary 1-5% 2-4 hours No anomalies in monitoring dashboards
Limited 10-25% 4-8 hours Metrics match or exceed pre-incident baseline
Majority 50-75% 8-24 hours No user reports; regression tests passing
Full 100% Ongoing monitoring 72 hours of stable operation

Post-Incident Analysis

Conduct a blameless post-incident review (PIR) within 72 hours of resolution:

# Post-Incident Review Template
## Incident Summary
- Incident ID: AI-IR-2026-042
- Severity: SEV-2
- Duration: 6 hours (detection to resolution)
- Impact: ~12,000 users received degraded responses

## Timeline
- 14:23 UTC - Monitoring alert triggered
- 14:31 UTC - On-call engineer confirmed
- 14:45 UTC - Severity upgraded to SEV-2
- 15:10 UTC - Model rolled back to v3.1
- 20:15 UTC - Root cause identified
- 20:30 UTC - Incident resolved

## Root Cause
Fine-tuning dataset contained adversarially
crafted samples that introduced a backdoor.

## Action Items
1. [P0] Add adversarial sample detection to
   data pipeline
2. [P1] Implement model behavioral diff testing
3. [P2] Update fine-tuning data review process
💡
Looking Ahead: In the final lesson, we will bring everything together with best practices for building AI incident response playbooks, conducting tabletop exercises, and continuously improving your IR capabilities.