Beginner

Probability & Distributions

Probability is the mathematical language of uncertainty. Understanding probability distributions is essential for statistical modeling, hypothesis testing, and making predictions with confidence.

Probability Basics

Probability measures how likely an event is to occur, expressed as a number between 0 (impossible) and 1 (certain).

Probability Formula
P(A) = Number of favorable outcomes / Total number of outcomes

Example: Rolling a 6 on a fair die
P(6) = 1/6 = 0.167 = 16.7%

Key rules:
P(A or B) = P(A) + P(B) - P(A and B)    # Addition rule
P(A and B) = P(A) * P(B)                  # If independent
P(not A) = 1 - P(A)                       # Complement rule

Conditional Probability

Conditional probability is the probability of an event given that another event has already occurred. It is written as P(A|B) — "the probability of A given B."

Formula
P(A|B) = P(A and B) / P(B)

Example: What is the probability a customer buys
if they visited the product page?

P(Buy | Visit) = P(Buy and Visit) / P(Visit)
P(Buy | Visit) = 0.05 / 0.30 = 0.167 = 16.7%

Bayes' Theorem

Bayes' theorem lets you update probabilities as you receive new evidence. It is the foundation of Bayesian statistics and many machine learning algorithms.

Python
# Bayes' Theorem: P(A|B) = P(B|A) * P(A) / P(B)

# Example: Medical test accuracy
# Disease prevalence: 1%
# Test sensitivity (true positive rate): 95%
# Test specificity (true negative rate): 90%

p_disease = 0.01           # Prior probability
p_positive_given_disease = 0.95  # Sensitivity
p_positive_given_healthy = 0.10  # False positive rate

# Total probability of positive test
p_positive = (p_positive_given_disease * p_disease +
              p_positive_given_healthy * (1 - p_disease))

# Probability of disease given positive test
p_disease_given_positive = (p_positive_given_disease * p_disease) / p_positive

print(f"P(Disease | Positive test) = {p_disease_given_positive:.1%}")
# Result: ~8.8% - surprisingly low!
💡
The base rate fallacy: Even with a 95% accurate test, a positive result only means an ~8.8% chance of disease when the disease is rare (1% prevalence). This is why understanding Bayes' theorem matters in real-world decision making.

Probability Distributions

A probability distribution describes how the values of a random variable are spread out. Different distributions model different types of data.

Normal Distribution (Gaussian)

The most important distribution in statistics. Many natural phenomena follow a bell-shaped curve. It is defined by two parameters: mean (μ) and standard deviation (σ).

Python
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

# Create a normal distribution
mu, sigma = 100, 15  # IQ scores: mean=100, std=15
dist = stats.norm(loc=mu, scale=sigma)

# Probability of scoring below 130
print(f"P(IQ < 130) = {dist.cdf(130):.4f}")  # ~0.9772

# Probability of scoring between 85 and 115
p = dist.cdf(115) - dist.cdf(85)
print(f"P(85 < IQ < 115) = {p:.4f}")  # ~0.6827 (68-95-99.7 rule)

# Generate random samples
samples = np.random.normal(mu, sigma, 10000)
plt.hist(samples, bins=50, density=True, alpha=0.7)
plt.title('Normal Distribution (IQ Scores)')
plt.show()

Binomial Distribution

Models the number of successes in a fixed number of independent trials, each with the same probability of success. Think: coin flips, pass/fail rates, click-through rates.

Python
# Binomial: n trials, each with probability p
n, p = 100, 0.3  # 100 emails, 30% open rate
dist = stats.binom(n=n, p=p)

# Probability of exactly 35 opens
print(f"P(X = 35) = {dist.pmf(35):.4f}")

# Probability of 25 or fewer opens
print(f"P(X <= 25) = {dist.cdf(25):.4f}")

# Expected value and std deviation
print(f"Mean = {dist.mean():.1f}, Std = {dist.std():.1f}")

Poisson Distribution

Models the number of events in a fixed time period when events occur independently at a constant average rate. Think: website visits per hour, defects per batch, calls per day.

Python
# Poisson: average rate (lambda) of events
lam = 5  # Average 5 support tickets per hour
dist = stats.poisson(mu=lam)

# Probability of exactly 8 tickets in an hour
print(f"P(X = 8) = {dist.pmf(8):.4f}")

# Probability of 10 or more tickets
print(f"P(X >= 10) = {1 - dist.cdf(9):.4f}")

Other Important Distributions

Distribution Use Case Parameters
Uniform All outcomes equally likely (random number generation) a (min), b (max)
Exponential Time between events (wait times, lifespans) λ (rate)
t-distribution Small sample sizes, heavier tails than normal df (degrees of freedom)
Chi-square Goodness-of-fit tests, categorical data analysis df (degrees of freedom)

The Central Limit Theorem

The Central Limit Theorem (CLT) is arguably the most important theorem in statistics. It states:

Central Limit Theorem: Regardless of the original population distribution, the distribution of sample means approaches a normal distribution as the sample size increases (typically n ≥ 30). The mean of sample means equals the population mean, and the standard error equals σ/√n.
Python
import numpy as np
import matplotlib.pyplot as plt

# Demonstrate CLT with a skewed distribution
np.random.seed(42)

# Original population: exponential (very skewed)
population = np.random.exponential(scale=2, size=100000)

# Take 1000 samples of size 30 and compute means
sample_means = [np.random.choice(population, size=30).mean()
                for _ in range(1000)]

# Plot: sample means are approximately normal!
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].hist(population, bins=50, density=True)
axes[0].set_title('Original Population (Skewed)')
axes[1].hist(sample_means, bins=50, density=True)
axes[1].set_title('Distribution of Sample Means (Normal!)')
plt.tight_layout()
plt.show()

The CLT is why so many statistical methods work: hypothesis tests, confidence intervals, and regression all rely on the assumption that sample means are normally distributed, and the CLT guarantees this for large enough samples.