Data Collection & Cleaning
Data is rarely ready for analysis out of the box. Learn how to gather data from multiple sources and clean it into a usable format using Python and Pandas.
Data Sources
Data can come from many places. As a data scientist, you will work with a variety of sources depending on the project.
Databases
SQL databases (PostgreSQL, MySQL) and NoSQL databases (MongoDB) store structured business data.
APIs
REST APIs provide data from web services — weather, social media, financial markets, and more.
Files
CSV, Excel, JSON, and Parquet files are common formats for sharing and storing datasets.
Web Scraping
Extract data from websites using libraries like BeautifulSoup and Scrapy when APIs are not available.
Loading Data with Pandas
import pandas as pd # CSV files df = pd.read_csv('data.csv') # Excel files df = pd.read_excel('report.xlsx', sheet_name='Sheet1') # JSON files df = pd.read_json('data.json') # SQL databases import sqlite3 conn = sqlite3.connect('database.db') df = pd.read_sql('SELECT * FROM customers', conn) # API data import requests response = requests.get('https://api.example.com/data') df = pd.DataFrame(response.json())
Data Formats Comparison
| Format | Best For | Pros | Cons |
|---|---|---|---|
| CSV | Tabular data exchange | Universal, human-readable | No data types, large files |
| JSON | Nested/API data | Flexible structure, web-standard | Verbose, harder to query |
| Excel | Business reports | Multiple sheets, formatting | Proprietary, slow with large data |
| Parquet | Big data storage | Compressed, fast, typed | Not human-readable |
| SQL | Structured business data | Queryable, relational, ACID | Requires database setup |
Data Cleaning
Data cleaning is the process of fixing or removing incorrect, corrupted, duplicate, or incomplete data. It typically takes 60-80% of a data scientist's time.
Handling Missing Values
# Detect missing values print(df.isnull().sum()) # Count per column print(df.isnull().mean() * 100) # Percentage per column # Drop rows with missing values df_clean = df.dropna() # Drop rows where specific columns are missing df_clean = df.dropna(subset=['name', 'email']) # Fill missing values df['age'].fillna(df['age'].median(), inplace=True) df['city'].fillna('Unknown', inplace=True) # Forward fill (use previous value) df['temperature'].fillna(method='ffill', inplace=True)
Removing Duplicates
# Check for duplicates print(df.duplicated().sum()) # Count duplicates # Remove exact duplicates df = df.drop_duplicates() # Remove duplicates based on specific columns df = df.drop_duplicates(subset=['email'], keep='first')
Handling Outliers
# IQR method for detecting outliers Q1 = df['price'].quantile(0.25) Q3 = df['price'].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR # Filter out outliers df_clean = df[(df['price'] >= lower_bound) & (df['price'] <= upper_bound)] # Or cap outliers instead of removing df['price'] = df['price'].clip(lower=lower_bound, upper=upper_bound)
Data Type Conversion
# Convert string dates to datetime df['date'] = pd.to_datetime(df['date']) # Clean currency strings and convert to float df['price'] = df['price'].str.replace('$', '').str.replace(',', '').astype(float) # Convert to categorical (saves memory for repeated values) df['status'] = df['status'].astype('category') # Standardize text df['name'] = df['name'].str.strip().str.lower()
Lilly Tech Systems