Beginner

Data Collection & Cleaning

Data is rarely ready for analysis out of the box. Learn how to gather data from multiple sources and clean it into a usable format using Python and Pandas.

Data Sources

Data can come from many places. As a data scientist, you will work with a variety of sources depending on the project.

🖧

Databases

SQL databases (PostgreSQL, MySQL) and NoSQL databases (MongoDB) store structured business data.

🔗

APIs

REST APIs provide data from web services — weather, social media, financial markets, and more.

📄

Files

CSV, Excel, JSON, and Parquet files are common formats for sharing and storing datasets.

🌐

Web Scraping

Extract data from websites using libraries like BeautifulSoup and Scrapy when APIs are not available.

Loading Data with Pandas

Python

import pandas as pd

# CSV files
df = pd.read_csv('data.csv')

# Excel files
df = pd.read_excel('report.xlsx', sheet_name='Sheet1')

# JSON files
df = pd.read_json('data.json')

# SQL databases
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM customers', conn)

# API data
import requests
response = requests.get('https://api.example.com/data')
df = pd.DataFrame(response.json())

Data Formats Comparison

Format	Best For	Pros	Cons
CSV	Tabular data exchange	Universal, human-readable	No data types, large files
JSON	Nested/API data	Flexible structure, web-standard	Verbose, harder to query
Excel	Business reports	Multiple sheets, formatting	Proprietary, slow with large data
Parquet	Big data storage	Compressed, fast, typed	Not human-readable
SQL	Structured business data	Queryable, relational, ACID	Requires database setup

Data Cleaning

Data cleaning is the process of fixing or removing incorrect, corrupted, duplicate, or incomplete data. It typically takes 60-80% of a data scientist's time.

Handling Missing Values

Python

# Detect missing values
print(df.isnull().sum())          # Count per column
print(df.isnull().mean() * 100)  # Percentage per column

# Drop rows with missing values
df_clean = df.dropna()

# Drop rows where specific columns are missing
df_clean = df.dropna(subset=['name', 'email'])

# Fill missing values
df['age'].fillna(df['age'].median(), inplace=True)
df['city'].fillna('Unknown', inplace=True)

# Forward fill (use previous value)
df['temperature'].fillna(method='ffill', inplace=True)

Removing Duplicates

Python

# Check for duplicates
print(df.duplicated().sum())  # Count duplicates

# Remove exact duplicates
df = df.drop_duplicates()

# Remove duplicates based on specific columns
df = df.drop_duplicates(subset=['email'], keep='first')

Handling Outliers

Python

# IQR method for detecting outliers
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out outliers
df_clean = df[(df['price'] >= lower_bound) & (df['price'] <= upper_bound)]

# Or cap outliers instead of removing
df['price'] = df['price'].clip(lower=lower_bound, upper=upper_bound)

Data Type Conversion

Python

# Convert string dates to datetime
df['date'] = pd.to_datetime(df['date'])

# Clean currency strings and convert to float
df['price'] = df['price'].str.replace('$', '').str.replace(',', '').astype(float)

# Convert to categorical (saves memory for repeated values)
df['status'] = df['status'].astype('category')

# Standardize text
df['name'] = df['name'].str.strip().str.lower()

✅

Pro tip: Always create a data cleaning checklist for your projects: check dtypes, look for nulls, find duplicates, validate ranges, and standardize text. Document every cleaning step for reproducibility.

⚠

Important: Never modify your original data file. Always work on a copy and keep the raw data intact. This allows you to re-run your cleaning pipeline if your approach changes.

← Previous Python for Data Science Next → Exploratory Data Analysis