Advanced

Recovery

Recovery in AI systems goes beyond restoring service. It may require retraining models, remediating data, and rebuilding trust — processes that can take days or weeks rather than minutes.

Model Retraining and Remediation

When the root cause is in the model itself (poisoned data, faulty fine-tuning, or discovered backdoors), recovery requires retraining:

Root Cause Identification

Before retraining, determine exactly what caused the incident. Was it training data contamination, a configuration error, a supply chain compromise, or adversarial exploitation? The fix must address the root cause.
Data Remediation

If training data was compromised, identify and remove poisoned samples. Verify the integrity of remaining data. This may require re-auditing the entire training dataset.
Model Retraining

Retrain the model on clean data with verified integrity. Run extended evaluation suites including adversarial test cases derived from the incident.
Validation and Testing

Validate the new model specifically against the failure mode that caused the incident. Create regression tests that would catch this specific issue.

⚠

Retraining Time: Large model retraining can take days to weeks and cost thousands of dollars in compute. Plan for this in your recovery timeline. For LLMs served via API, coordinate with your provider for model-level fixes.

Gradual Traffic Restoration

Phase	Traffic %	Duration	Exit Criteria
Canary	1-5%	2-4 hours	No anomalies in monitoring dashboards
Limited	10-25%	4-8 hours	Metrics match or exceed pre-incident baseline
Majority	50-75%	8-24 hours	No user reports; regression tests passing
Full	100%	Ongoing monitoring	72 hours of stable operation

Post-Incident Analysis

Conduct a blameless post-incident review (PIR) within 72 hours of resolution:

# Post-Incident Review Template
## Incident Summary
- Incident ID: AI-IR-2026-042
- Severity: SEV-2
- Duration: 6 hours (detection to resolution)
- Impact: ~12,000 users received degraded responses

## Timeline
- 14:23 UTC - Monitoring alert triggered
- 14:31 UTC - On-call engineer confirmed
- 14:45 UTC - Severity upgraded to SEV-2
- 15:10 UTC - Model rolled back to v3.1
- 20:15 UTC - Root cause identified
- 20:30 UTC - Incident resolved

## Root Cause
Fine-tuning dataset contained adversarially
crafted samples that introduced a backdoor.

## Action Items
1. [P0] Add adversarial sample detection to
   data pipeline
2. [P1] Implement model behavioral diff testing
3. [P2] Update fine-tuning data review process

💡

Looking Ahead: In the final lesson, we will bring everything together with best practices for building AI incident response playbooks, conducting tabletop exercises, and continuously improving your IR capabilities.

← Previous Containment Next → Best Practices

Recovery

Model Retraining and Remediation

Root Cause Identification

Data Remediation

Model Retraining

Validation and Testing

Gradual Traffic Restoration

Post-Incident Analysis