HomeGuidesData ScienceStatistics for Data Science — Mean, Distributions, Hypothesis Testing
📊 Data Science

Statistics Fundamentals for Data Science

Statistics is the language of data science. Here's what exams test — from descriptive stats to hypothesis testing.

Examifyr·2026·6 min read

Descriptive statistics

Descriptive statistics summarise and describe data.

import numpy as np
from scipy import stats

data = [23, 25, 26, 28, 28, 28, 30, 32, 35, 100]

np.mean(data)    # 35.5  — sensitive to outliers
np.median(data)  # 28.5  — robust to outliers
stats.mode(data) # 28    — most frequent value

np.std(data)     # standard deviation
np.var(data)     # variance = std^2
np.percentile(data, 75)  # 75th percentile (Q3)

# IQR (Interquartile Range) — robust spread measure
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1   # range of middle 50% of data
Note: Mean is sensitive to outliers; median is robust. When the mean and median diverge significantly, you likely have outliers or a skewed distribution.

Probability distributions

Distributions describe how data is spread. Know the most common ones.

# Normal (Gaussian) distribution:
# Bell curve, symmetric
# 68% within 1 std, 95% within 2, 99.7% within 3 (68-95-99.7 rule)
# Real examples: heights, measurement errors

# Binomial distribution:
# n independent trials, each with probability p of success
# Example: 10 coin flips, probability of k heads
from scipy.stats import binom
P_exactly_6_heads = binom.pmf(k=6, n=10, p=0.5)  # 0.205

# Poisson distribution:
# Count of events in fixed time/space, mean rate = lambda
# Example: customer arrivals per hour
from scipy.stats import poisson
P_3_arrivals = poisson.pmf(k=3, mu=5)  # mean=5 arrivals/hr

# Uniform distribution: all values equally likely
# Bernoulli: single trial (1 success with prob p)

Hypothesis testing

Hypothesis testing determines if observed data supports a claim about a population.

# Steps:
# 1. State H0 (null hypothesis) and H1 (alternative)
# 2. Choose significance level α (typically 0.05)
# 3. Compute test statistic
# 4. Compute p-value
# 5. Reject H0 if p-value < α

from scipy import stats

# One-sample t-test: is mean different from a value?
data = [2.1, 2.3, 2.0, 2.2, 2.4, 1.9]
t_stat, p_value = stats.ttest_1samp(data, popmean=2.0)
# If p < 0.05: reject H0 (mean is NOT 2.0)

# Two-sample t-test: are two groups different?
group_a = [85, 87, 90, 83, 88]
group_b = [78, 82, 80, 79, 85]
t_stat, p_value = stats.ttest_ind(group_a, group_b)

P-values and confidence intervals

P-values quantify evidence against the null hypothesis. Confidence intervals give a range for the true parameter.

# P-value interpretation:
# p-value = probability of observing this result (or more extreme)
#           ASSUMING H0 is true
# p < 0.05: "statistically significant" (reject H0)
# p > 0.05: "fail to reject H0" (not enough evidence)

# COMMON MISCONCEPTIONS:
# p-value is NOT the probability that H0 is true
# p-value is NOT the probability of a false positive
# p < 0.05 does NOT mean the effect is practically significant

# Confidence interval:
import numpy as np
from scipy import stats

data = [85, 87, 90, 83, 88, 86, 84, 89, 91, 82]
mean = np.mean(data)
ci = stats.t.interval(0.95, df=len(data)-1,
                       loc=mean, scale=stats.sem(data))
# 95% CI: (83.5, 89.1) means: if we repeated this 100 times,
# the CI would contain the true mean 95% of the time
Note: A p-value < 0.05 means statistical significance, NOT practical importance. A tiny effect can be statistically significant with a large sample.

Exam tip

The p-value misconception is the most common statistics exam trap. A p-value is NOT "the probability that the null hypothesis is true" — it's the probability of observing this data assuming the null is true.

🎯

Think you're ready? Prove it.

Take the free Data Science readiness test. Get a score, topic breakdown, and your exact weak areas.

Take the free Data Science test →

Free · No sign-up · Instant results

← Previous
Pandas Explained — DataFrames, Filtering, GroupBy & Merge
Next →
Machine Learning Overview — Supervised, Unsupervised & Key Concepts
← All Data Science guides