Stress Testing
Push your models to their limits with load testing, automated edge case generation, boundary analysis, and systematic stress testing pipelines that reveal hidden failure modes.
What is ML Stress Testing?
Stress testing goes beyond standard evaluation by deliberately pushing models into extreme conditions. While perturbation testing focuses on input-level manipulations, stress testing examines system-level behavior under adverse conditions: high load, extreme inputs, resource constraints, and cascading failures.
Load Testing ML Endpoints
Production ML services must handle variable traffic without degradation. Load testing reveals performance bottlenecks and failure thresholds.
from locust import HttpUser, task, between class MLEndpointUser(HttpUser): wait_time = between(0.1, 0.5) @task(3) def predict_normal(self): """Normal-sized prediction request.""" self.client.post("/predict", json={ "features": [0.5] * 128 }) @task(1) def predict_large_batch(self): """Large batch prediction request.""" self.client.post("/predict/batch", json={ "instances": [[0.5] * 128] * 100 }) @task(1) def predict_edge_case(self): """Extreme value inputs.""" self.client.post("/predict", json={ "features": [1e10] * 128 })
Edge Case Generation
Automated edge case generation discovers inputs that are valid but unusual, targeting model blind spots:
| Technique | Description | Tools |
|---|---|---|
| Fuzzing | Generate random mutations of valid inputs to find crashes and errors | Hypothesis, AFL |
| Metamorphic Testing | Apply transformations that should preserve the output and check consistency | Custom scripts |
| Property-Based Testing | Define properties that must hold for all inputs and generate counterexamples | Hypothesis, QuickCheck |
| Coverage-Guided Generation | Generate inputs that activate new neurons or code paths | DeepXplore, DLFuzz |
Boundary Analysis
Test model behavior at decision boundaries and input extremes:
- Numerical boundaries: Test with zero, negative, very large, very small, NaN, and infinity values.
- String boundaries: Empty strings, extremely long strings, unicode edge cases, and null bytes.
- Decision boundaries: Find inputs near classification thresholds where small changes flip the prediction.
- Temporal boundaries: Test with timestamps at epoch, year boundaries, leap seconds, and timezone transitions.
Automated Stress Testing Pipeline
Define Stress Scenarios
Create a catalog of stress conditions: extreme inputs, concurrent requests, resource limits, dependency failures.
Generate Test Data
Use fuzzing, property-based testing, and domain knowledge to create comprehensive test suites.
Execute Under Monitoring
Run stress tests while monitoring latency, memory, CPU, GPU utilization, error rates, and prediction quality.
Analyze and Report
Identify failure thresholds, create degradation curves, and document discovered failure modes.
Lilly Tech Systems