Accuracy Precision Recall F1
Understanding and choosing the right classification metrics. Part of the AI Model Testing Fundamentals course at AI School by Lilly Tech Systems.
Classification Metrics Demystified
Choosing the right metric is one of the most important decisions in AI testing. A model can score 99% accuracy and still be completely useless. Understanding when and why this happens is critical for every ML practitioner. This lesson breaks down the four fundamental classification metrics, when to use each one, and common pitfalls that lead to misleading results.
Accuracy: The Simplest Metric
Accuracy measures the proportion of correct predictions out of all predictions made. It is calculated as (True Positives + True Negatives) / Total Predictions. While intuitive, accuracy is deeply misleading for imbalanced datasets.
Consider a fraud detection system where only 1% of transactions are fraudulent. A model that always predicts "not fraud" achieves 99% accuracy while catching zero actual fraud cases. This is why accuracy alone is never sufficient for evaluating AI models in real-world applications.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, confusion_matrix
y_true = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1] # 90% class 0, 10% class 1
y_pred = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] # Always predicts class 0
print(f"Accuracy: {accuracy_score(y_true, y_pred):.2f}") # 0.90
print(f"Precision: {precision_score(y_true, y_pred, zero_division=0):.2f}") # 0.00
print(f"Recall: {recall_score(y_true, y_pred):.2f}") # 0.00
print(f"F1 Score: {f1_score(y_true, y_pred):.2f}") # 0.00
Precision: When False Positives Are Costly
Precision measures what proportion of positive predictions were actually correct: True Positives / (True Positives + False Positives). High precision means when the model says "yes," it is usually right. Precision matters most when the cost of a false positive is high:
- Spam filtering — Marking a legitimate email as spam means the user misses important communication
- Content moderation — Incorrectly flagging harmless content frustrates users and censors free expression
- Medical diagnosis — False positives lead to unnecessary treatments with potential side effects
Recall: When False Negatives Are Costly
Recall (also called sensitivity or true positive rate) measures what proportion of actual positives the model caught: True Positives / (True Positives + False Negatives). High recall means the model misses very few positive cases. Recall matters most when missing a positive case is dangerous:
- Cancer screening — Missing a cancer diagnosis can be life-threatening
- Fraud detection — Missing fraudulent transactions costs money and erodes trust
- Security threat detection — Missing a real threat can have severe consequences
The Precision-Recall Tradeoff
Precision and recall are inversely related. Increasing the classification threshold makes the model more selective (higher precision, lower recall). Decreasing the threshold makes the model more inclusive (higher recall, lower precision). Your business context determines where to set this tradeoff.
F1 Score: Balancing Precision and Recall
The F1 score is the harmonic mean of precision and recall: 2 * (Precision * Recall) / (Precision + Recall). It provides a single number that balances both metrics. The F1 score is particularly useful when you need a balanced metric but the dataset is imbalanced.
# Comprehensive classification evaluation
from sklearn.metrics import classification_report
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1]
report = classification_report(y_true, y_pred, target_names=["Negative", "Positive"])
print(report)
# Output:
# precision recall f1-score support
# Negative 0.71 0.83 0.77 6
# Positive 0.88 0.78 0.82 9
# accuracy 0.80 15
# macro avg 0.79 0.81 0.80 15
# weighted avg 0.81 0.80 0.80 15
Beyond Basic Metrics
For multi-class problems, you have additional averaging strategies:
- Macro average — Calculate the metric for each class independently, then average. Treats all classes equally regardless of size.
- Weighted average — Calculate the metric for each class, then take a weighted average by class support. Accounts for class imbalance.
- Micro average — Aggregate all true positives, false positives, and false negatives globally, then calculate the metric.
Additionally, the AUC-ROC curve provides a threshold-independent measure of model discrimination. The Precision-Recall curve is more informative than ROC for imbalanced datasets. Matthews Correlation Coefficient (MCC) provides a balanced measure even with very imbalanced classes.
Choosing the Right Metric for Your Use Case
There is no single "best" metric. The right choice depends entirely on your business context, the cost of different error types, and the class distribution of your data. As a general guide: use accuracy only for balanced datasets, use precision when false positives are expensive, use recall when false negatives are dangerous, use F1 when you need balance, and always examine per-class metrics regardless of which aggregate metric you choose.
Lilly Tech Systems