Statistics Fundamentals for Data Science
Statistics is the language of data science. Here's what exams test — from descriptive stats to hypothesis testing.
Descriptive statistics
Descriptive statistics summarise and describe data.
import numpy as np from scipy import stats data = [23, 25, 26, 28, 28, 28, 30, 32, 35, 100] np.mean(data) # 35.5 — sensitive to outliers np.median(data) # 28.5 — robust to outliers stats.mode(data) # 28 — most frequent value np.std(data) # standard deviation np.var(data) # variance = std^2 np.percentile(data, 75) # 75th percentile (Q3) # IQR (Interquartile Range) — robust spread measure q1 = np.percentile(data, 25) q3 = np.percentile(data, 75) iqr = q3 - q1 # range of middle 50% of data
Probability distributions
Distributions describe how data is spread. Know the most common ones.
# Normal (Gaussian) distribution: # Bell curve, symmetric # 68% within 1 std, 95% within 2, 99.7% within 3 (68-95-99.7 rule) # Real examples: heights, measurement errors # Binomial distribution: # n independent trials, each with probability p of success # Example: 10 coin flips, probability of k heads from scipy.stats import binom P_exactly_6_heads = binom.pmf(k=6, n=10, p=0.5) # 0.205 # Poisson distribution: # Count of events in fixed time/space, mean rate = lambda # Example: customer arrivals per hour from scipy.stats import poisson P_3_arrivals = poisson.pmf(k=3, mu=5) # mean=5 arrivals/hr # Uniform distribution: all values equally likely # Bernoulli: single trial (1 success with prob p)
Hypothesis testing
Hypothesis testing determines if observed data supports a claim about a population.
# Steps: # 1. State H0 (null hypothesis) and H1 (alternative) # 2. Choose significance level α (typically 0.05) # 3. Compute test statistic # 4. Compute p-value # 5. Reject H0 if p-value < α from scipy import stats # One-sample t-test: is mean different from a value? data = [2.1, 2.3, 2.0, 2.2, 2.4, 1.9] t_stat, p_value = stats.ttest_1samp(data, popmean=2.0) # If p < 0.05: reject H0 (mean is NOT 2.0) # Two-sample t-test: are two groups different? group_a = [85, 87, 90, 83, 88] group_b = [78, 82, 80, 79, 85] t_stat, p_value = stats.ttest_ind(group_a, group_b)
P-values and confidence intervals
P-values quantify evidence against the null hypothesis. Confidence intervals give a range for the true parameter.
# P-value interpretation:
# p-value = probability of observing this result (or more extreme)
# ASSUMING H0 is true
# p < 0.05: "statistically significant" (reject H0)
# p > 0.05: "fail to reject H0" (not enough evidence)
# COMMON MISCONCEPTIONS:
# p-value is NOT the probability that H0 is true
# p-value is NOT the probability of a false positive
# p < 0.05 does NOT mean the effect is practically significant
# Confidence interval:
import numpy as np
from scipy import stats
data = [85, 87, 90, 83, 88, 86, 84, 89, 91, 82]
mean = np.mean(data)
ci = stats.t.interval(0.95, df=len(data)-1,
loc=mean, scale=stats.sem(data))
# 95% CI: (83.5, 89.1) means: if we repeated this 100 times,
# the CI would contain the true mean 95% of the timeExam tip
The p-value misconception is the most common statistics exam trap. A p-value is NOT "the probability that the null hypothesis is true" — it's the probability of observing this data assuming the null is true.
Think you're ready? Prove it.
Take the free Data Science readiness test. Get a score, topic breakdown, and your exact weak areas.
Take the free Data Science test →Free · No sign-up · Instant results