Machine Learning Overview: Supervised, Unsupervised, and Key Concepts
Machine learning fundamentals are tested on every data science exam. Here's the core concepts and where beginners get confused.
Supervised vs unsupervised learning
The fundamental distinction in machine learning is whether you have labelled training data.
# Supervised Learning: labelled training data (X → y) # Classification: predict a category # - Logistic Regression, Decision Trees, Random Forest, SVM, Neural Networks # - Examples: spam detection, disease diagnosis, image classification # Regression: predict a continuous value # - Linear Regression, Ridge/Lasso, Random Forest, XGBoost # - Examples: house price prediction, sales forecasting # Unsupervised Learning: no labels, find structure in data # Clustering: group similar data points # - K-Means, DBSCAN, Hierarchical # - Examples: customer segmentation, anomaly detection # Dimensionality Reduction: reduce features # - PCA, t-SNE, UMAP # - Examples: visualisation, preprocessing # Semi-supervised: small labelled + large unlabelled data
Train, validation, and test sets
Splitting data correctly is critical for honest model evaluation.
from sklearn.model_selection import train_test_split
X, y = features, labels
# Train/test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train/validation/test (60/20/20)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)
# Rules:
# Train: model learns from this
# Validation: tune hyperparameters, select model
# Test: final, unbiased evaluation (touch ONCE at the end)Overfitting, underfitting, and bias-variance tradeoff
These concepts explain why ML models fail to generalise to new data.
# Overfitting: # - Model too complex, memorises training data # - Low training error, high test error # - Symptoms: perfect training accuracy, poor test accuracy # - Fix: more data, regularisation, simpler model, dropout # Underfitting: # - Model too simple, can't capture patterns # - High training error AND high test error # - Fix: more features, more complex model, less regularisation # Bias-Variance Tradeoff: # Bias: error from wrong assumptions (underfitting) # Variance: sensitivity to training data fluctuations (overfitting) # High bias → underfitting (e.g., linear model for non-linear data) # High variance → overfitting (e.g., deep decision tree) # Goal: find the sweet spot # - Cross-validation helps estimate generalisation error
Common algorithms overview
Know when to use which algorithm — a common exam and interview topic.
# Linear Regression # - Continuous output, linear relationship assumption # - Fast, interpretable, good baseline # Logistic Regression # - Binary/multi-class classification # - Outputs probabilities, interpretable # Decision Tree # - Classification and regression # - Interpretable, but prone to overfitting # Random Forest # - Ensemble of decision trees (bagging) # - Reduces variance, handles missing values # - Less interpretable than single tree # Gradient Boosting (XGBoost, LightGBM) # - Builds trees sequentially, each corrects previous errors # - Often best for tabular data # - Can overfit if not tuned # K-Nearest Neighbours (KNN) # - Non-parametric, lazy learning # - Slow at prediction, sensitive to scale
Exam tip
Overfitting vs underfitting is the most common ML exam question. Remember: overfitting = low train error, high test error (too complex); underfitting = high both (too simple). The test set should NEVER be used for model selection.
Think you're ready? Prove it.
Take the free Data Science readiness test. Get a score, topic breakdown, and your exact weak areas.
Take the free Data Science test →Free · No sign-up · Instant results