Pandas Basics: DataFrames, Filtering, and GroupBy
Pandas is the data manipulation library in Python. Here's what data science exams test — DataFrames, filtering, and aggregation.
DataFrames and Series
A DataFrame is a 2D table. A Series is a single column with an index.
import pandas as pd
# Create DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Carol', 'Dave'],
'age': [25, 30, 35, 28],
'salary': [70000, 85000, 90000, 75000],
'dept': ['Eng', 'Eng', 'Product', 'Design'],
})
df.shape # (4, 4)
df.dtypes # column data types
df.describe() # summary statistics
df.info() # column types + non-null counts
df.head(2) # first 2 rows
df.tail(2) # last 2 rowsSelecting and filtering data
Pandas has two main selectors: .loc (label-based) and .iloc (integer position-based).
# Column selection df['name'] # single column (Series) df[['name', 'salary']] # multiple columns (DataFrame) # Row selection df.loc[0] # row at label 0 df.iloc[0] # row at position 0 df.loc[1:3, 'name':'age'] # rows 1-3, cols name to age df.iloc[0:2, 0:2] # first 2 rows and cols # Boolean filtering df[df['age'] > 28] df[(df['dept'] == 'Eng') & (df['salary'] > 80000)] df[df['name'].isin(['Alice', 'Carol'])]
GroupBy and aggregation
GroupBy splits data into groups, applies a function, and combines results.
# Single column groupby
df.groupby('dept')['salary'].mean()
# dept
# Design 75000.0
# Eng 77500.0
# Product 90000.0
# Multiple aggregations
df.groupby('dept').agg({
'salary': ['mean', 'min', 'max'],
'age': 'mean',
})
# Transform: return same-indexed result
df['dept_avg_salary'] = df.groupby('dept')['salary'].transform('mean')
# Pivot table
pd.pivot_table(df, values='salary', index='dept', aggfunc='mean')Merging and joining DataFrames
Pandas merge is equivalent to SQL JOINs.
employees = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Carol'],
'dept_id': [1, 1, 2],
})
departments = pd.DataFrame({
'id': [1, 2],
'dept': ['Engineering', 'Product'],
})
# INNER JOIN (default)
pd.merge(employees, departments, left_on='dept_id', right_on='id')
# LEFT JOIN
pd.merge(employees, departments, left_on='dept_id', right_on='id', how='left')
# Concatenate DataFrames vertically
pd.concat([df1, df2], ignore_index=True)Handling missing values
Pandas uses NaN for missing values. Know how to detect, drop, and fill them.
df.isnull().sum() # count NaN per column df.isnull().any(axis=1) # rows with any NaN # Drop rows with any NaN df.dropna() # Drop columns with any NaN df.dropna(axis=1) # Fill with specific value df.fillna(0) df['salary'].fillna(df['salary'].mean()) # Fill forward/backward df.fillna(method='ffill') # forward fill df.fillna(method='bfill') # backward fill # Check that value is NaN import numpy as np np.isnan(value) # True if NaN
Exam tip
The most common Pandas interview question: "What's the difference between loc and iloc?" — loc uses labels (index names), iloc uses integer positions. Also know: & and | for boolean filtering (not `and`/`or`).
Think you're ready? Prove it.
Take the free Data Science readiness test. Get a score, topic breakdown, and your exact weak areas.
Take the free Data Science test →Free · No sign-up · Instant results