Recovery
Recovery in AI systems goes beyond restoring service. It may require retraining models, remediating data, and rebuilding trust — processes that can take days or weeks rather than minutes.
Model Retraining and Remediation
When the root cause is in the model itself (poisoned data, faulty fine-tuning, or discovered backdoors), recovery requires retraining:
-
Root Cause Identification
Before retraining, determine exactly what caused the incident. Was it training data contamination, a configuration error, a supply chain compromise, or adversarial exploitation? The fix must address the root cause.
-
Data Remediation
If training data was compromised, identify and remove poisoned samples. Verify the integrity of remaining data. This may require re-auditing the entire training dataset.
-
Model Retraining
Retrain the model on clean data with verified integrity. Run extended evaluation suites including adversarial test cases derived from the incident.
-
Validation and Testing
Validate the new model specifically against the failure mode that caused the incident. Create regression tests that would catch this specific issue.
Gradual Traffic Restoration
| Phase | Traffic % | Duration | Exit Criteria |
|---|---|---|---|
| Canary | 1-5% | 2-4 hours | No anomalies in monitoring dashboards |
| Limited | 10-25% | 4-8 hours | Metrics match or exceed pre-incident baseline |
| Majority | 50-75% | 8-24 hours | No user reports; regression tests passing |
| Full | 100% | Ongoing monitoring | 72 hours of stable operation |
Post-Incident Analysis
Conduct a blameless post-incident review (PIR) within 72 hours of resolution:
# Post-Incident Review Template
## Incident Summary
- Incident ID: AI-IR-2026-042
- Severity: SEV-2
- Duration: 6 hours (detection to resolution)
- Impact: ~12,000 users received degraded responses
## Timeline
- 14:23 UTC - Monitoring alert triggered
- 14:31 UTC - On-call engineer confirmed
- 14:45 UTC - Severity upgraded to SEV-2
- 15:10 UTC - Model rolled back to v3.1
- 20:15 UTC - Root cause identified
- 20:30 UTC - Incident resolved
## Root Cause
Fine-tuning dataset contained adversarially
crafted samples that introduced a backdoor.
## Action Items
1. [P0] Add adversarial sample detection to
data pipeline
2. [P1] Implement model behavioral diff testing
3. [P2] Update fine-tuning data review process