HomeGuidesData ScienceFeature Engineering — Encoding, Scaling, Missing Values & Selection
📊 Data Science

Feature Engineering: Encoding, Scaling, and Handling Missing Data

Feature engineering often matters more than model choice. Here's what data science exams test.

Examifyr·2026·5 min read

Categorical encoding

ML models require numerical inputs. Categorical variables must be encoded.

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

df = pd.DataFrame({'color': ['red', 'blue', 'green', 'red', 'blue']})

# Label Encoding: assigns integer to each category
# Good for: ordinal categories (low=0, medium=1, high=2)
# WRONG for: nominal categories (implies order that doesn't exist)
le = LabelEncoder()
df['color_encoded'] = le.fit_transform(df['color'])
# red=2, blue=0, green=1

# One-Hot Encoding: binary column per category
# Good for: nominal categories (no inherent order)
pd.get_dummies(df['color'], drop_first=True)
# Columns: color_blue, color_green (red is the reference)

# Ordinal encoding (for ordered categories)
order = {'low': 0, 'medium': 1, 'high': 2}
df['size_encoded'] = df['size'].map(order)
Note: Using label encoding for nominal categories (colours, cities) introduces false ordering that can confuse tree models. Use one-hot encoding for unordered categories.

Feature scaling

Distance-based models (KNN, SVM, neural networks) are sensitive to feature scale. Tree-based models are not.

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# StandardScaler (Z-score normalisation):
# Result: mean=0, std=1
# Good for: normally distributed data, SVM, linear regression, neural networks
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)

# MinMaxScaler (normalisation):
# Result: values between 0 and 1
# Good for: when you need values in a specific range, neural networks
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_train)

# CRITICAL: fit only on training data, then transform train AND test
scaler.fit(X_train)            # learn mean/std from training data only
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)  # use training stats on test
Note: Never fit the scaler on the test set — it would leak information about the test distribution into the training process.

Handling missing values

Missing data must be handled before training. The strategy depends on the amount and pattern of missingness.

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer

# Types of missingness:
# MCAR: Missing Completely At Random — safe to drop
# MAR: Missing At Random — impute
# MNAR: Missing Not At Random — requires domain knowledge

# Simple imputation
imputer = SimpleImputer(strategy='mean')    # or 'median', 'most_frequent'
X_imputed = imputer.fit_transform(X)

# KNN Imputation (uses similar rows)
knn_imputer = KNNImputer(n_neighbors=5)
X_imputed = knn_imputer.fit_transform(X)

# Add indicator column for whether value was missing
# (missingness itself can be a signal)
df['age_was_missing'] = df['age'].isna().astype(int)

# For tree-based models: some (XGBoost) handle NaN natively

Feature selection

Removing irrelevant features can improve model performance and reduce training time.

from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.ensemble import RandomForestClassifier

# Filter methods: statistical tests
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X_train, y_train)
selected_features = X.columns[selector.get_support()]

# Wrapper methods: try subsets with a model
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=10)
rfe.fit(X_train, y_train)

# Embedded methods: model-based importance
rf = RandomForestClassifier().fit(X_train, y_train)
importance = pd.Series(rf.feature_importances_, index=X.columns)
importance.nlargest(10).plot.bar()
Note: Feature importance from tree models can be misleading for highly correlated features — consider permutation importance for more reliable results.

Exam tip

The most common feature engineering exam question: "When should you scale features?" — always for distance-based models (KNN, SVM, neural networks, logistic regression); not needed for tree-based models (Random Forest, XGBoost). Also remember: fit scaler on train set only.

🎯

Think you're ready? Prove it.

Take the free Data Science readiness test. Get a score, topic breakdown, and your exact weak areas.

Take the free Data Science test →

Free · No sign-up · Instant results

← Previous
ML Model Evaluation — Accuracy, Precision, Recall, F1 & ROC-AUC
← All Data Science guides