Model Evaluation: Accuracy, Precision, Recall, F1, and ROC-AUC
Choosing the right evaluation metric is critical. Here's what exams test — precision vs recall tradeoff and when accuracy fails.
Confusion matrix
The confusion matrix shows the breakdown of correct and incorrect predictions.
# Predicted Positive Predicted Negative # Actual Positive TP (True Pos) FN (False Neg) # Actual Negative FP (False Pos) TN (True Neg) # TP: correctly predicted positive # TN: correctly predicted negative # FP: predicted positive, actually negative (Type I error) # FN: predicted negative, actually positive (Type II error) from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_true, y_pred) # Example: # 100 patients, 10 have disease # Model: TP=8, FN=2, FP=5, TN=85 # Accuracy: (TP+TN)/(total) = 93/100 = 93% # But 90% accuracy achievable by predicting "no disease" for everyone!
Precision, Recall, and F1
Precision and Recall capture different types of errors. F1 balances them.
# Precision = TP / (TP + FP) # "Of all positive predictions, how many were actually positive?" # High precision = fewer false alarms # Use when: false positives are costly (spam detection, fraud alert) # Recall (Sensitivity) = TP / (TP + FN) # "Of all actual positives, how many did we catch?" # High recall = fewer misses # Use when: false negatives are costly (cancer detection, fraud prevention) # F1 Score = 2 * (Precision * Recall) / (Precision + Recall) # Harmonic mean — penalises extreme imbalance # Use when: you need balance between precision and recall from sklearn.metrics import precision_score, recall_score, f1_score precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred)
ROC-AUC
ROC curve plots True Positive Rate vs False Positive Rate at different thresholds. AUC summarises it as a single number.
# TPR (True Positive Rate / Recall) = TP / (TP + FN) # FPR (False Positive Rate) = FP / (FP + TN) # ROC Curve: plot TPR vs FPR at all thresholds # AUC (Area Under Curve): # - 1.0: perfect model # - 0.5: random (diagonal line) # - < 0.5: worse than random from sklearn.metrics import roc_auc_score, roc_curve # AUC is threshold-independent — shows overall discriminative ability auc = roc_auc_score(y_true, y_pred_proba) # When AUC is useful: # - Imbalanced classes (accuracy can be misleading) # - When you want to evaluate ranking ability, not a specific threshold # Precision-Recall AUC: better metric for very imbalanced datasets from sklearn.metrics import average_precision_score ap = average_precision_score(y_true, y_pred_proba)
Cross-validation
Cross-validation gives a more reliable performance estimate by training on different data subsets.
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
# k-Fold Cross-Validation:
# Split data into k folds
# Train on k-1 folds, evaluate on 1
# Repeat k times, average the scores
scores = cross_val_score(model, X, y, cv=5, scoring='f1')
print(f"F1: {scores.mean():.3f} +/- {scores.std():.3f}")
# Stratified k-Fold: maintains class proportion in each fold
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv)Exam tip
The most common ML metrics question: "When would you use precision vs recall?" — precision when false positives are expensive (spam filter); recall when false negatives are expensive (disease detection, fraud). Accuracy is misleading for imbalanced datasets.
Think you're ready? Prove it.
Take the free Data Science readiness test. Get a score, topic breakdown, and your exact weak areas.
Take the free Data Science test →Free · No sign-up · Instant results