Python for Data Science
Python is the most popular language for data science. Learn the essential libraries, data structures, and techniques you will use every day as a data scientist.
Why Python?
Python dominates data science for several reasons:
- Readable syntax — Easy to learn and write, even for non-programmers
- Rich ecosystem — Thousands of libraries for data analysis, visualization, and machine learning
- Community support — Massive community with tutorials, forums, and open-source projects
- Integration — Works seamlessly with databases, APIs, cloud platforms, and other languages
Essential Libraries
NumPy — Numerical Computing
NumPy provides fast, efficient arrays and mathematical operations. It is the foundation for nearly every data science library in Python.
import numpy as np # Create arrays arr = np.array([1, 2, 3, 4, 5]) matrix = np.array([[1, 2], [3, 4]]) # Basic operations print(arr.mean()) # 3.0 print(arr.std()) # 1.414 print(arr.sum()) # 15 # Vectorized operations (much faster than loops) doubled = arr * 2 # [2, 4, 6, 8, 10] squared = arr ** 2 # [1, 4, 9, 16, 25] # Generate data zeros = np.zeros((3, 3)) random_data = np.random.randn(100) # 100 random numbers
Pandas — Data Manipulation
Pandas is the workhorse of data science. It provides the DataFrame — a powerful table-like data structure for loading, cleaning, transforming, and analyzing data.
import pandas as pd # Load data from a CSV file df = pd.read_csv('sales_data.csv') # Quick overview print(df.head()) # First 5 rows print(df.shape) # (rows, columns) print(df.dtypes) # Column data types print(df.describe()) # Summary statistics # Select columns names = df['name'] subset = df[['name', 'price', 'quantity']] # Filter rows expensive = df[df['price'] > 100] recent = df[df['date'] >= '2025-01-01']
Grouping and Aggregation
One of Pandas' most powerful features is groupby(), which lets you split data into groups, apply functions, and combine results.
# Group by category and calculate stats category_stats = df.groupby('category')['price'].agg(['mean', 'sum', 'count']) # Multiple aggregations summary = df.groupby('region').agg({ 'revenue': 'sum', 'quantity': 'mean', 'order_id': 'count' }).rename(columns={'order_id': 'num_orders'}) # Sort results top_regions = summary.sort_values('revenue', ascending=False)
Matplotlib & Seaborn — Visualization
Matplotlib is the foundational plotting library. Seaborn builds on top of it with statistical visualizations and better defaults.
import matplotlib.pyplot as plt import seaborn as sns # Simple line plot plt.plot(df['date'], df['revenue']) plt.title('Revenue Over Time') plt.xlabel('Date') plt.ylabel('Revenue ($)') plt.show() # Seaborn scatter plot with regression line sns.regplot(x='advertising', y='sales', data=df) plt.title('Advertising vs Sales') plt.show()
Jupyter Notebooks
Jupyter notebooks are interactive documents that combine code, output, and text. They are the standard tool for data science exploration.
pip install jupyterlab and launch with jupyter lab. Or use Google Colab for a free, browser-based notebook environment.Python Data Types for Data Science
| Type | Python | Pandas dtype | Example |
|---|---|---|---|
| Integer | int |
int64 |
42, -7, 0 |
| Float | float |
float64 |
3.14, -0.5 |
| String | str |
object |
"hello", "NYC" |
| Boolean | bool |
bool |
True, False |
| DateTime | datetime |
datetime64 |
2025-01-15 |
| Category | — | category |
"small", "medium", "large" |
# Check and convert data types print(df.dtypes) # Convert types df['price'] = df['price'].astype(float) df['date'] = pd.to_datetime(df['date']) df['category'] = df['category'].astype('category')
object type. Always check df.dtypes and clean your data before performing calculations.
Lilly Tech Systems