Beginner

Data Collection & Cleaning

Data is rarely ready for analysis out of the box. Learn how to gather data from multiple sources and clean it into a usable format using Python and Pandas.

Data Sources

Data can come from many places. As a data scientist, you will work with a variety of sources depending on the project.

🖧

Databases

SQL databases (PostgreSQL, MySQL) and NoSQL databases (MongoDB) store structured business data.

🔗

APIs

REST APIs provide data from web services — weather, social media, financial markets, and more.

📄

Files

CSV, Excel, JSON, and Parquet files are common formats for sharing and storing datasets.

🌐

Web Scraping

Extract data from websites using libraries like BeautifulSoup and Scrapy when APIs are not available.

Loading Data with Pandas

Python
import pandas as pd

# CSV files
df = pd.read_csv('data.csv')

# Excel files
df = pd.read_excel('report.xlsx', sheet_name='Sheet1')

# JSON files
df = pd.read_json('data.json')

# SQL databases
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM customers', conn)

# API data
import requests
response = requests.get('https://api.example.com/data')
df = pd.DataFrame(response.json())

Data Formats Comparison

Format Best For Pros Cons
CSV Tabular data exchange Universal, human-readable No data types, large files
JSON Nested/API data Flexible structure, web-standard Verbose, harder to query
Excel Business reports Multiple sheets, formatting Proprietary, slow with large data
Parquet Big data storage Compressed, fast, typed Not human-readable
SQL Structured business data Queryable, relational, ACID Requires database setup

Data Cleaning

Data cleaning is the process of fixing or removing incorrect, corrupted, duplicate, or incomplete data. It typically takes 60-80% of a data scientist's time.

Handling Missing Values

Python
# Detect missing values
print(df.isnull().sum())          # Count per column
print(df.isnull().mean() * 100)  # Percentage per column

# Drop rows with missing values
df_clean = df.dropna()

# Drop rows where specific columns are missing
df_clean = df.dropna(subset=['name', 'email'])

# Fill missing values
df['age'].fillna(df['age'].median(), inplace=True)
df['city'].fillna('Unknown', inplace=True)

# Forward fill (use previous value)
df['temperature'].fillna(method='ffill', inplace=True)

Removing Duplicates

Python
# Check for duplicates
print(df.duplicated().sum())  # Count duplicates

# Remove exact duplicates
df = df.drop_duplicates()

# Remove duplicates based on specific columns
df = df.drop_duplicates(subset=['email'], keep='first')

Handling Outliers

Python
# IQR method for detecting outliers
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out outliers
df_clean = df[(df['price'] >= lower_bound) & (df['price'] <= upper_bound)]

# Or cap outliers instead of removing
df['price'] = df['price'].clip(lower=lower_bound, upper=upper_bound)

Data Type Conversion

Python
# Convert string dates to datetime
df['date'] = pd.to_datetime(df['date'])

# Clean currency strings and convert to float
df['price'] = df['price'].str.replace('$', '').str.replace(',', '').astype(float)

# Convert to categorical (saves memory for repeated values)
df['status'] = df['status'].astype('category')

# Standardize text
df['name'] = df['name'].str.strip().str.lower()
Pro tip: Always create a data cleaning checklist for your projects: check dtypes, look for nulls, find duplicates, validate ranges, and standardize text. Document every cleaning step for reproducibility.
Important: Never modify your original data file. Always work on a copy and keep the raw data intact. This allows you to re-run your cleaning pipeline if your approach changes.