Beginner

Accuracy Precision Recall F1

Understanding and choosing the right classification metrics. Part of the AI Model Testing Fundamentals course at AI School by Lilly Tech Systems.

Classification Metrics Demystified

Choosing the right metric is one of the most important decisions in AI testing. A model can score 99% accuracy and still be completely useless. Understanding when and why this happens is critical for every ML practitioner. This lesson breaks down the four fundamental classification metrics, when to use each one, and common pitfalls that lead to misleading results.

Accuracy: The Simplest Metric

Accuracy measures the proportion of correct predictions out of all predictions made. It is calculated as (True Positives + True Negatives) / Total Predictions. While intuitive, accuracy is deeply misleading for imbalanced datasets.

Consider a fraud detection system where only 1% of transactions are fraudulent. A model that always predicts "not fraud" achieves 99% accuracy while catching zero actual fraud cases. This is why accuracy alone is never sufficient for evaluating AI models in real-world applications.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix

y_true = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]  # 90% class 0, 10% class 1
y_pred = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]  # Always predicts class 0

print(f"Accuracy: {accuracy_score(y_true, y_pred):.2f}")  # 0.90
print(f"Precision: {precision_score(y_true, y_pred, zero_division=0):.2f}")  # 0.00
print(f"Recall: {recall_score(y_true, y_pred):.2f}")  # 0.00
print(f"F1 Score: {f1_score(y_true, y_pred):.2f}")  # 0.00

Precision: When False Positives Are Costly

Precision measures what proportion of positive predictions were actually correct: True Positives / (True Positives + False Positives). High precision means when the model says "yes," it is usually right. Precision matters most when the cost of a false positive is high:

Spam filtering — Marking a legitimate email as spam means the user misses important communication
Content moderation — Incorrectly flagging harmless content frustrates users and censors free expression
Medical diagnosis — False positives lead to unnecessary treatments with potential side effects

Recall: When False Negatives Are Costly

Recall (also called sensitivity or true positive rate) measures what proportion of actual positives the model caught: True Positives / (True Positives + False Negatives). High recall means the model misses very few positive cases. Recall matters most when missing a positive case is dangerous:

Cancer screening — Missing a cancer diagnosis can be life-threatening
Fraud detection — Missing fraudulent transactions costs money and erodes trust
Security threat detection — Missing a real threat can have severe consequences

The Precision-Recall Tradeoff

Precision and recall are inversely related. Increasing the classification threshold makes the model more selective (higher precision, lower recall). Decreasing the threshold makes the model more inclusive (higher recall, lower precision). Your business context determines where to set this tradeoff.

💡

Practical advice: Before choosing a metric, ask stakeholders: "What is worse — a false positive or a false negative?" The answer determines whether you optimize for precision or recall. Document this decision as part of your model specification.

F1 Score: Balancing Precision and Recall

The F1 score is the harmonic mean of precision and recall: 2 * (Precision * Recall) / (Precision + Recall). It provides a single number that balances both metrics. The F1 score is particularly useful when you need a balanced metric but the dataset is imbalanced.

# Comprehensive classification evaluation
from sklearn.metrics import classification_report

y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1]

report = classification_report(y_true, y_pred, target_names=["Negative", "Positive"])
print(report)

# Output:
#               precision    recall  f1-score   support
#     Negative       0.71      0.83      0.77         6
#     Positive       0.88      0.78      0.82         9
#     accuracy                           0.80        15
#    macro avg       0.79      0.81      0.80        15
# weighted avg       0.81      0.80      0.80        15

Beyond Basic Metrics

For multi-class problems, you have additional averaging strategies:

Macro average — Calculate the metric for each class independently, then average. Treats all classes equally regardless of size.
Weighted average — Calculate the metric for each class, then take a weighted average by class support. Accounts for class imbalance.
Micro average — Aggregate all true positives, false positives, and false negatives globally, then calculate the metric.

Additionally, the AUC-ROC curve provides a threshold-independent measure of model discrimination. The Precision-Recall curve is more informative than ROC for imbalanced datasets. Matthews Correlation Coefficient (MCC) provides a balanced measure even with very imbalanced classes.

⚠

Critical: Always report metrics per class, not just averages. A model may have excellent average F1 but perform terribly on minority classes. Per-class metrics reveal these hidden failures.

Choosing the Right Metric for Your Use Case

There is no single "best" metric. The right choice depends entirely on your business context, the cost of different error types, and the class distribution of your data. As a general guide: use accuracy only for balanced datasets, use precision when false positives are expensive, use recall when false negatives are dangerous, use F1 when you need balance, and always examine per-class metrics regardless of which aggregate metric you choose.

← Previous Test Design for ML Models Next → Cross-Validation Techniques