Advanced

Quality Assurance in BMAD

Build a comprehensive AI quality assurance framework covering testing, evaluation metrics, regression testing, human evaluation, bias detection, safety testing, and production monitoring.

AI Quality Assurance Framework

Traditional QA tests for correctness — the output either matches the expected result or it does not. AI QA tests for quality on a spectrum, where outputs may be acceptable, good, or excellent, and the same input can produce different outputs each time.

💡
Key insight: AI QA is statistical, not binary. You are measuring the probability that your AI system produces acceptable output, not proving it always does. A 95% accuracy rate means 1 in 20 outputs may be problematic.

Testing AI Outputs

BMAD defines three layers of AI testing:

  1. Automated Evaluation

    Run prompts against labeled test datasets and measure accuracy, completeness, and format compliance automatically. This is your first line of defense.

  2. LLM-as-Judge

    Use a separate AI model to evaluate the quality of another model's output. Cost-effective for large test sets, though less reliable than human evaluation.

  3. Human Evaluation

    Subject matter experts review a sample of AI outputs for quality, accuracy, and appropriateness. The gold standard, but expensive and slow.

Python - Automated Evaluation
class AIEvaluator:
    def evaluate(self, prompt, test_dataset):
        results = []
        for case in test_dataset:
            output = llm.call(prompt, case.input)
            score = {
                "accuracy": self.check_accuracy(
                    output, case.expected
                ),
                "format": self.check_format(
                    output, case.schema
                ),
                "latency": output.latency_ms,
                "tokens": output.total_tokens,
            }
            results.append(score)

        return {
            "accuracy": mean(r["accuracy"] for r in results),
            "format_compliance": mean(r["format"] for r in results),
            "avg_latency": mean(r["latency"] for r in results),
            "total_tests": len(results),
        }

Evaluation Metrics

Metric What It Measures Target Range
Accuracy Percentage of outputs matching expected results 85-99% depending on use case
Hallucination Rate Percentage of outputs containing fabricated information <5% for factual tasks
Latency (p50/p95/p99) Response time at different percentiles Varies by feature requirements
Format Compliance Percentage of outputs matching expected structure >98%
Cost per Request Average API cost per inference call Set per business requirements
Consistency How similar are outputs for the same input across runs >90% for deterministic tasks

Regression Testing for Prompts

When you update a prompt, ensure the new version does not degrade quality on previously passing cases:

CI Pipeline - Prompt Regression Test
name: Prompt Regression Test
on:
  pull_request:
    paths:
      - 'prompts/**'

jobs:
  regression-test:
    steps:
      - name: Run evaluation suite
        run: python eval/run_tests.py --prompt $CHANGED_PROMPT
      - name: Compare with baseline
        run: python eval/compare.py --threshold 0.02
        # Fail if accuracy drops more than 2%

Human Evaluation Workflows

Structure human evaluation for consistency and efficiency:

📋

Rating Rubrics

Define clear scoring criteria (1-5 scale) for each quality dimension. Train evaluators on the rubric before they begin.

👥

Inter-Rater Agreement

Have multiple evaluators rate the same outputs. Measure agreement (Cohen's kappa) to ensure consistency.

📈

Sampling Strategy

Evaluate a representative sample (100-500 outputs) rather than every output. Stratify by input type and difficulty.

🔄

Continuous Sampling

In production, randomly sample outputs for ongoing human review. Set up alerts when quality scores trend downward.

Bias Detection

Test your AI system for unfair bias across demographic groups, sensitive topics, and edge cases:

Python - Bias Testing
def test_demographic_parity(prompt, test_pairs):
    """Test if outputs differ unfairly across groups."""
    results = {}
    for group, inputs in test_pairs.items():
        outputs = [llm.call(prompt, inp) for inp in inputs]
        results[group] = {
            "positive_rate": count_positive(outputs) / len(outputs),
            "avg_sentiment": mean_sentiment(outputs),
            "avg_length": mean(len(o) for o in outputs),
        }

    # Flag significant differences between groups
    max_diff = max_parity_difference(results)
    assert max_diff < 0.1, \
        f"Parity difference {max_diff} exceeds threshold"

Safety Testing

Ensure your AI system handles adversarial inputs and edge cases safely:

  • Prompt injection testing: Verify the system resists attempts to override system prompts or instructions.
  • Harmful content filtering: Test that the system refuses to generate harmful, illegal, or inappropriate content.
  • Data leakage testing: Ensure the system does not reveal sensitive training data, API keys, or system prompts.
  • Boundary testing: Test with extremely long inputs, empty inputs, special characters, and multiple languages.

Monitoring in Production

Set up dashboards and alerts for ongoing AI quality monitoring:

Monitoring Dashboard Metrics
Real-Time Metrics:
  - Request volume (requests/min)
  - Error rate (% of failed calls)
  - Latency (p50, p95, p99)
  - Token usage (input + output)
  - Cost accumulation ($/hour)

Quality Metrics (hourly):
  - Format compliance rate
  - Average output length
  - User feedback scores (thumbs up/down)
  - Escalation rate to humans

Alerts:
  - Error rate > 5% for 5 minutes
  - p95 latency > 10 seconds
  - Daily cost exceeds budget by 20%
  - User satisfaction drops below 80%
  - Model provider reports degradation
Pro tip: Save a sample of production inputs and outputs daily. Use these to build your regression test dataset over time. Real-world data is invaluable for catching edge cases you would not think to test for.