Beginner

Python for Data Science

Python is the most popular language for data science. Learn the essential libraries, data structures, and techniques you will use every day as a data scientist.

Why Python?

Python dominates data science for several reasons:

  • Readable syntax — Easy to learn and write, even for non-programmers
  • Rich ecosystem — Thousands of libraries for data analysis, visualization, and machine learning
  • Community support — Massive community with tutorials, forums, and open-source projects
  • Integration — Works seamlessly with databases, APIs, cloud platforms, and other languages

Essential Libraries

NumPy — Numerical Computing

NumPy provides fast, efficient arrays and mathematical operations. It is the foundation for nearly every data science library in Python.

Python
import numpy as np

# Create arrays
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2], [3, 4]])

# Basic operations
print(arr.mean())    # 3.0
print(arr.std())     # 1.414
print(arr.sum())     # 15

# Vectorized operations (much faster than loops)
doubled = arr * 2        # [2, 4, 6, 8, 10]
squared = arr ** 2       # [1, 4, 9, 16, 25]

# Generate data
zeros = np.zeros((3, 3))
random_data = np.random.randn(100)  # 100 random numbers

Pandas — Data Manipulation

Pandas is the workhorse of data science. It provides the DataFrame — a powerful table-like data structure for loading, cleaning, transforming, and analyzing data.

Python
import pandas as pd

# Load data from a CSV file
df = pd.read_csv('sales_data.csv')

# Quick overview
print(df.head())        # First 5 rows
print(df.shape)         # (rows, columns)
print(df.dtypes)        # Column data types
print(df.describe())    # Summary statistics

# Select columns
names = df['name']
subset = df[['name', 'price', 'quantity']]

# Filter rows
expensive = df[df['price'] > 100]
recent = df[df['date'] >= '2025-01-01']

Grouping and Aggregation

One of Pandas' most powerful features is groupby(), which lets you split data into groups, apply functions, and combine results.

Python
# Group by category and calculate stats
category_stats = df.groupby('category')['price'].agg(['mean', 'sum', 'count'])

# Multiple aggregations
summary = df.groupby('region').agg({
    'revenue': 'sum',
    'quantity': 'mean',
    'order_id': 'count'
}).rename(columns={'order_id': 'num_orders'})

# Sort results
top_regions = summary.sort_values('revenue', ascending=False)

Matplotlib & Seaborn — Visualization

Matplotlib is the foundational plotting library. Seaborn builds on top of it with statistical visualizations and better defaults.

Python
import matplotlib.pyplot as plt
import seaborn as sns

# Simple line plot
plt.plot(df['date'], df['revenue'])
plt.title('Revenue Over Time')
plt.xlabel('Date')
plt.ylabel('Revenue ($)')
plt.show()

# Seaborn scatter plot with regression line
sns.regplot(x='advertising', y='sales', data=df)
plt.title('Advertising vs Sales')
plt.show()

Jupyter Notebooks

Jupyter notebooks are interactive documents that combine code, output, and text. They are the standard tool for data science exploration.

Getting started: Install Jupyter with pip install jupyterlab and launch with jupyter lab. Or use Google Colab for a free, browser-based notebook environment.

Python Data Types for Data Science

Type Python Pandas dtype Example
Integer int int64 42, -7, 0
Float float float64 3.14, -0.5
String str object "hello", "NYC"
Boolean bool bool True, False
DateTime datetime datetime64 2025-01-15
Category category "small", "medium", "large"
Python
# Check and convert data types
print(df.dtypes)

# Convert types
df['price'] = df['price'].astype(float)
df['date'] = pd.to_datetime(df['date'])
df['category'] = df['category'].astype('category')
Watch out: Pandas reads numbers stored as strings (e.g., "$1,234") as object type. Always check df.dtypes and clean your data before performing calculations.