Probability & Distributions
Probability is the mathematical language of uncertainty. Understanding probability distributions is essential for statistical modeling, hypothesis testing, and making predictions with confidence.
Probability Basics
Probability measures how likely an event is to occur, expressed as a number between 0 (impossible) and 1 (certain).
P(A) = Number of favorable outcomes / Total number of outcomes Example: Rolling a 6 on a fair die P(6) = 1/6 = 0.167 = 16.7% Key rules: P(A or B) = P(A) + P(B) - P(A and B) # Addition rule P(A and B) = P(A) * P(B) # If independent P(not A) = 1 - P(A) # Complement rule
Conditional Probability
Conditional probability is the probability of an event given that another event has already occurred. It is written as P(A|B) — "the probability of A given B."
P(A|B) = P(A and B) / P(B)
Example: What is the probability a customer buys
if they visited the product page?
P(Buy | Visit) = P(Buy and Visit) / P(Visit)
P(Buy | Visit) = 0.05 / 0.30 = 0.167 = 16.7%
Bayes' Theorem
Bayes' theorem lets you update probabilities as you receive new evidence. It is the foundation of Bayesian statistics and many machine learning algorithms.
# Bayes' Theorem: P(A|B) = P(B|A) * P(A) / P(B) # Example: Medical test accuracy # Disease prevalence: 1% # Test sensitivity (true positive rate): 95% # Test specificity (true negative rate): 90% p_disease = 0.01 # Prior probability p_positive_given_disease = 0.95 # Sensitivity p_positive_given_healthy = 0.10 # False positive rate # Total probability of positive test p_positive = (p_positive_given_disease * p_disease + p_positive_given_healthy * (1 - p_disease)) # Probability of disease given positive test p_disease_given_positive = (p_positive_given_disease * p_disease) / p_positive print(f"P(Disease | Positive test) = {p_disease_given_positive:.1%}") # Result: ~8.8% - surprisingly low!
Probability Distributions
A probability distribution describes how the values of a random variable are spread out. Different distributions model different types of data.
Normal Distribution (Gaussian)
The most important distribution in statistics. Many natural phenomena follow a bell-shaped curve. It is defined by two parameters: mean (μ) and standard deviation (σ).
from scipy import stats import numpy as np import matplotlib.pyplot as plt # Create a normal distribution mu, sigma = 100, 15 # IQ scores: mean=100, std=15 dist = stats.norm(loc=mu, scale=sigma) # Probability of scoring below 130 print(f"P(IQ < 130) = {dist.cdf(130):.4f}") # ~0.9772 # Probability of scoring between 85 and 115 p = dist.cdf(115) - dist.cdf(85) print(f"P(85 < IQ < 115) = {p:.4f}") # ~0.6827 (68-95-99.7 rule) # Generate random samples samples = np.random.normal(mu, sigma, 10000) plt.hist(samples, bins=50, density=True, alpha=0.7) plt.title('Normal Distribution (IQ Scores)') plt.show()
Binomial Distribution
Models the number of successes in a fixed number of independent trials, each with the same probability of success. Think: coin flips, pass/fail rates, click-through rates.
# Binomial: n trials, each with probability p n, p = 100, 0.3 # 100 emails, 30% open rate dist = stats.binom(n=n, p=p) # Probability of exactly 35 opens print(f"P(X = 35) = {dist.pmf(35):.4f}") # Probability of 25 or fewer opens print(f"P(X <= 25) = {dist.cdf(25):.4f}") # Expected value and std deviation print(f"Mean = {dist.mean():.1f}, Std = {dist.std():.1f}")
Poisson Distribution
Models the number of events in a fixed time period when events occur independently at a constant average rate. Think: website visits per hour, defects per batch, calls per day.
# Poisson: average rate (lambda) of events lam = 5 # Average 5 support tickets per hour dist = stats.poisson(mu=lam) # Probability of exactly 8 tickets in an hour print(f"P(X = 8) = {dist.pmf(8):.4f}") # Probability of 10 or more tickets print(f"P(X >= 10) = {1 - dist.cdf(9):.4f}")
Other Important Distributions
| Distribution | Use Case | Parameters |
|---|---|---|
| Uniform | All outcomes equally likely (random number generation) | a (min), b (max) |
| Exponential | Time between events (wait times, lifespans) | λ (rate) |
| t-distribution | Small sample sizes, heavier tails than normal | df (degrees of freedom) |
| Chi-square | Goodness-of-fit tests, categorical data analysis | df (degrees of freedom) |
The Central Limit Theorem
The Central Limit Theorem (CLT) is arguably the most important theorem in statistics. It states:
import numpy as np import matplotlib.pyplot as plt # Demonstrate CLT with a skewed distribution np.random.seed(42) # Original population: exponential (very skewed) population = np.random.exponential(scale=2, size=100000) # Take 1000 samples of size 30 and compute means sample_means = [np.random.choice(population, size=30).mean() for _ in range(1000)] # Plot: sample means are approximately normal! fig, axes = plt.subplots(1, 2, figsize=(12, 4)) axes[0].hist(population, bins=50, density=True) axes[0].set_title('Original Population (Skewed)') axes[1].hist(sample_means, bins=50, density=True) axes[1].set_title('Distribution of Sample Means (Normal!)') plt.tight_layout() plt.show()
The CLT is why so many statistical methods work: hypothesis tests, confidence intervals, and regression all rely on the assumption that sample means are normally distributed, and the CLT guarantees this for large enough samples.
Lilly Tech Systems