Feature Engineering: Encoding, Scaling, and Handling Missing Data
Feature engineering often matters more than model choice. Here's what data science exams test.
Categorical encoding
ML models require numerical inputs. Categorical variables must be encoded.
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
df = pd.DataFrame({'color': ['red', 'blue', 'green', 'red', 'blue']})
# Label Encoding: assigns integer to each category
# Good for: ordinal categories (low=0, medium=1, high=2)
# WRONG for: nominal categories (implies order that doesn't exist)
le = LabelEncoder()
df['color_encoded'] = le.fit_transform(df['color'])
# red=2, blue=0, green=1
# One-Hot Encoding: binary column per category
# Good for: nominal categories (no inherent order)
pd.get_dummies(df['color'], drop_first=True)
# Columns: color_blue, color_green (red is the reference)
# Ordinal encoding (for ordered categories)
order = {'low': 0, 'medium': 1, 'high': 2}
df['size_encoded'] = df['size'].map(order)Feature scaling
Distance-based models (KNN, SVM, neural networks) are sensitive to feature scale. Tree-based models are not.
from sklearn.preprocessing import StandardScaler, MinMaxScaler # StandardScaler (Z-score normalisation): # Result: mean=0, std=1 # Good for: normally distributed data, SVM, linear regression, neural networks scaler = StandardScaler() X_scaled = scaler.fit_transform(X_train) # MinMaxScaler (normalisation): # Result: values between 0 and 1 # Good for: when you need values in a specific range, neural networks scaler = MinMaxScaler() X_scaled = scaler.fit_transform(X_train) # CRITICAL: fit only on training data, then transform train AND test scaler.fit(X_train) # learn mean/std from training data only X_train_scaled = scaler.transform(X_train) X_test_scaled = scaler.transform(X_test) # use training stats on test
Handling missing values
Missing data must be handled before training. The strategy depends on the amount and pattern of missingness.
import pandas as pd import numpy as np from sklearn.impute import SimpleImputer, KNNImputer # Types of missingness: # MCAR: Missing Completely At Random — safe to drop # MAR: Missing At Random — impute # MNAR: Missing Not At Random — requires domain knowledge # Simple imputation imputer = SimpleImputer(strategy='mean') # or 'median', 'most_frequent' X_imputed = imputer.fit_transform(X) # KNN Imputation (uses similar rows) knn_imputer = KNNImputer(n_neighbors=5) X_imputed = knn_imputer.fit_transform(X) # Add indicator column for whether value was missing # (missingness itself can be a signal) df['age_was_missing'] = df['age'].isna().astype(int) # For tree-based models: some (XGBoost) handle NaN natively
Feature selection
Removing irrelevant features can improve model performance and reduce training time.
from sklearn.feature_selection import SelectKBest, f_classif, RFE from sklearn.ensemble import RandomForestClassifier # Filter methods: statistical tests selector = SelectKBest(score_func=f_classif, k=10) X_selected = selector.fit_transform(X_train, y_train) selected_features = X.columns[selector.get_support()] # Wrapper methods: try subsets with a model rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=10) rfe.fit(X_train, y_train) # Embedded methods: model-based importance rf = RandomForestClassifier().fit(X_train, y_train) importance = pd.Series(rf.feature_importances_, index=X.columns) importance.nlargest(10).plot.bar()
Exam tip
The most common feature engineering exam question: "When should you scale features?" — always for distance-based models (KNN, SVM, neural networks, logistic regression); not needed for tree-based models (Random Forest, XGBoost). Also remember: fit scaler on train set only.
Think you're ready? Prove it.
Take the free Data Science readiness test. Get a score, topic breakdown, and your exact weak areas.
Take the free Data Science test →Free · No sign-up · Instant results