HomeGuidesData SciencePandas Explained — DataFrames, Filtering, GroupBy & Merge
📊 Data Science

Pandas Basics: DataFrames, Filtering, and GroupBy

Pandas is the data manipulation library in Python. Here's what data science exams test — DataFrames, filtering, and aggregation.

Examifyr·2026·6 min read

DataFrames and Series

A DataFrame is a 2D table. A Series is a single column with an index.

import pandas as pd

# Create DataFrame
df = pd.DataFrame({
    'name':   ['Alice', 'Bob', 'Carol', 'Dave'],
    'age':    [25, 30, 35, 28],
    'salary': [70000, 85000, 90000, 75000],
    'dept':   ['Eng', 'Eng', 'Product', 'Design'],
})

df.shape        # (4, 4)
df.dtypes       # column data types
df.describe()   # summary statistics
df.info()       # column types + non-null counts
df.head(2)      # first 2 rows
df.tail(2)      # last 2 rows

Selecting and filtering data

Pandas has two main selectors: .loc (label-based) and .iloc (integer position-based).

# Column selection
df['name']               # single column (Series)
df[['name', 'salary']]   # multiple columns (DataFrame)

# Row selection
df.loc[0]               # row at label 0
df.iloc[0]              # row at position 0
df.loc[1:3, 'name':'age']  # rows 1-3, cols name to age
df.iloc[0:2, 0:2]          # first 2 rows and cols

# Boolean filtering
df[df['age'] > 28]
df[(df['dept'] == 'Eng') & (df['salary'] > 80000)]
df[df['name'].isin(['Alice', 'Carol'])]
Note: Use & and | for boolean operations in Pandas (not and/or). Also wrap each condition in parentheses.

GroupBy and aggregation

GroupBy splits data into groups, applies a function, and combines results.

# Single column groupby
df.groupby('dept')['salary'].mean()
# dept
# Design    75000.0
# Eng       77500.0
# Product   90000.0

# Multiple aggregations
df.groupby('dept').agg({
    'salary': ['mean', 'min', 'max'],
    'age':    'mean',
})

# Transform: return same-indexed result
df['dept_avg_salary'] = df.groupby('dept')['salary'].transform('mean')

# Pivot table
pd.pivot_table(df, values='salary', index='dept', aggfunc='mean')

Merging and joining DataFrames

Pandas merge is equivalent to SQL JOINs.

employees = pd.DataFrame({
    'id':   [1, 2, 3],
    'name': ['Alice', 'Bob', 'Carol'],
    'dept_id': [1, 1, 2],
})
departments = pd.DataFrame({
    'id':   [1, 2],
    'dept': ['Engineering', 'Product'],
})

# INNER JOIN (default)
pd.merge(employees, departments, left_on='dept_id', right_on='id')

# LEFT JOIN
pd.merge(employees, departments, left_on='dept_id', right_on='id', how='left')

# Concatenate DataFrames vertically
pd.concat([df1, df2], ignore_index=True)

Handling missing values

Pandas uses NaN for missing values. Know how to detect, drop, and fill them.

df.isnull().sum()          # count NaN per column
df.isnull().any(axis=1)    # rows with any NaN

# Drop rows with any NaN
df.dropna()

# Drop columns with any NaN
df.dropna(axis=1)

# Fill with specific value
df.fillna(0)
df['salary'].fillna(df['salary'].mean())

# Fill forward/backward
df.fillna(method='ffill')  # forward fill
df.fillna(method='bfill')  # backward fill

# Check that value is NaN
import numpy as np
np.isnan(value)  # True if NaN
Note: NaN != NaN in Python (IEEE 754). Never use == to check for NaN. Use pd.isna() or np.isnan() instead.

Exam tip

The most common Pandas interview question: "What's the difference between loc and iloc?" — loc uses labels (index names), iloc uses integer positions. Also know: & and | for boolean filtering (not `and`/`or`).

🎯

Think you're ready? Prove it.

Take the free Data Science readiness test. Get a score, topic breakdown, and your exact weak areas.

Take the free Data Science test →

Free · No sign-up · Instant results

← Previous
NumPy Explained — Arrays, Broadcasting, Indexing & Common Operations
Next →
Statistics for Data Science — Mean, Distributions, Hypothesis Testing
← All Data Science guides