Home›Guides›Data Science›ML Model Evaluation — Accuracy, Precision, Recall, F1 & ROC-AUC

Know if you're actually ready. Take the Data Science quiz → get your AI readiness report.

📊 Data Science

Model Evaluation: Accuracy, Precision, Recall, F1, and ROC-AUC

Choosing the right evaluation metric is critical. Here's what exams test — precision vs recall tradeoff and when accuracy fails.

Examifyr·2026·6 min read

Confusion matrix

The confusion matrix shows the breakdown of correct and incorrect predictions.

#                 Predicted Positive  Predicted Negative
# Actual Positive     TP (True Pos)       FN (False Neg)
# Actual Negative     FP (False Pos)      TN (True Neg)

# TP: correctly predicted positive
# TN: correctly predicted negative
# FP: predicted positive, actually negative (Type I error)
# FN: predicted negative, actually positive (Type II error)

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true, y_pred)

# Example:
# 100 patients, 10 have disease
# Model: TP=8, FN=2, FP=5, TN=85

# Accuracy:  (TP+TN)/(total) = 93/100 = 93%
# But 90% accuracy achievable by predicting "no disease" for everyone!

Precision, Recall, and F1

Precision and Recall capture different types of errors. F1 balances them.

# Precision = TP / (TP + FP)
# "Of all positive predictions, how many were actually positive?"
# High precision = fewer false alarms
# Use when: false positives are costly (spam detection, fraud alert)

# Recall (Sensitivity) = TP / (TP + FN)
# "Of all actual positives, how many did we catch?"
# High recall = fewer misses
# Use when: false negatives are costly (cancer detection, fraud prevention)

# F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
# Harmonic mean — penalises extreme imbalance
# Use when: you need balance between precision and recall

from sklearn.metrics import precision_score, recall_score, f1_score

precision = precision_score(y_true, y_pred)
recall    = recall_score(y_true, y_pred)
f1        = f1_score(y_true, y_pred)

Note: Precision-Recall tradeoff: lowering the classification threshold increases recall but decreases precision. This is a core concept.

ROC-AUC

ROC curve plots True Positive Rate vs False Positive Rate at different thresholds. AUC summarises it as a single number.

# TPR (True Positive Rate / Recall) = TP / (TP + FN)
# FPR (False Positive Rate) = FP / (FP + TN)

# ROC Curve: plot TPR vs FPR at all thresholds
# AUC (Area Under Curve):
#   - 1.0: perfect model
#   - 0.5: random (diagonal line)
#   - < 0.5: worse than random

from sklearn.metrics import roc_auc_score, roc_curve

# AUC is threshold-independent — shows overall discriminative ability
auc = roc_auc_score(y_true, y_pred_proba)

# When AUC is useful:
# - Imbalanced classes (accuracy can be misleading)
# - When you want to evaluate ranking ability, not a specific threshold

# Precision-Recall AUC: better metric for very imbalanced datasets
from sklearn.metrics import average_precision_score
ap = average_precision_score(y_true, y_pred_proba)

Cross-validation

Cross-validation gives a more reliable performance estimate by training on different data subsets.

from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

# k-Fold Cross-Validation:
# Split data into k folds
# Train on k-1 folds, evaluate on 1
# Repeat k times, average the scores

scores = cross_val_score(model, X, y, cv=5, scoring='f1')
print(f"F1: {scores.mean():.3f} +/- {scores.std():.3f}")

# Stratified k-Fold: maintains class proportion in each fold
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv)

Note: Use StratifiedKFold for classification (preserves class proportions). Regular KFold is fine for regression.

Exam tip

The most common ML metrics question: "When would you use precision vs recall?" — precision when false positives are expensive (spam filter); recall when false negatives are expensive (disease detection, fraud). Accuracy is misleading for imbalanced datasets.

🎯

Think you're ready? Prove it.

Take the free Data Science readiness test. Get a score, topic breakdown, and your exact weak areas.

Take the free Data Science test →

Free · No sign-up · Instant results

Struggling with Data Science?

Work 1:1 with a vetted Data Science tutor on Wyzant. (affiliate link)

Find a Data Science tutor →

← Previous

Machine Learning Overview — Supervised, Unsupervised & Key Concepts

Feature Engineering — Encoding, Scaling, Missing Values & Selection

← All Data Science guides