Advanced

Best Practices

Write efficient, readable SQL and connect databases to Python for seamless data science workflows.

SQL Style Guide

  • Use UPPERCASE for SQL keywords: SELECT, FROM, WHERE.
  • Use snake_case for table and column names: order_items, user_id.
  • Put each major clause on its own line: SELECT, FROM, WHERE, GROUP BY.
  • Indent sub-clauses and columns for readability.
  • Use meaningful table aliases: orders o, users u.
  • Always use explicit JOIN syntax instead of comma-separated tables.

Query Optimization

  1. Select Only Needed Columns

    Avoid SELECT *. Specify only the columns you need to reduce data transfer and memory usage.

  2. Filter Early

    Apply WHERE conditions to reduce the number of rows processed before JOINs and aggregations.

  3. Use EXPLAIN

    Run EXPLAIN ANALYZE before your query to see the execution plan and identify bottlenecks.

  4. Index Wisely

    Create indexes on columns used in WHERE, JOIN, and ORDER BY clauses. But don't over-index — each index slows writes.

Understanding EXPLAIN

SQL
EXPLAIN ANALYZE
SELECT u.name, COUNT(*) AS order_count
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.order_date >= '2024-01-01'
GROUP BY u.name
ORDER BY order_count DESC
LIMIT 10;
Look for: Sequential scans on large tables (add an index), high row estimates vs actual rows (update statistics), and nested loop joins on large tables (consider hash joins).

Connecting SQL to Python

Python - SQLAlchemy + pandas
import pandas as pd
from sqlalchemy import create_engine

# Create database connection
engine = create_engine("postgresql://user:pass@localhost:5432/mydb")

# Read SQL query into a DataFrame
query = """
SELECT category, AVG(price) AS avg_price, COUNT(*) AS count
FROM products
GROUP BY category
HAVING COUNT(*) > 5
ORDER BY avg_price DESC
"""
df = pd.read_sql(query, engine)
print(df.head())

# Write DataFrame back to database
df.to_sql("summary_table", engine, if_exists="replace", index=False)
Python - SQLite (no server needed)
import sqlite3
import pandas as pd

# Connect to SQLite file
conn = sqlite3.connect("data.db")

# Read into DataFrame
df = pd.read_sql("SELECT * FROM sales WHERE year = 2024", conn)

# Always close the connection
conn.close()

Common Pitfalls

  • NULL comparisons: Use IS NULL / IS NOT NULL, never = NULL.
  • Cartesian products: Always include a JOIN condition to avoid accidental cross joins.
  • GROUP BY errors: Every non-aggregated column in SELECT must be in GROUP BY.
  • Integer division: In some databases, 5/2 = 2. Cast to float: 5.0/2 or CAST(5 AS FLOAT)/2.
  • SQL injection: Never concatenate user input into SQL strings. Use parameterized queries.
  • N+1 queries: Avoid running queries in loops. Use JOINs or batch operations instead.

Frequently Asked Questions

PostgreSQL is the best choice for data science. It has excellent support for window functions, CTEs, JSON, and advanced analytics. SQLite is great for quick prototyping without server setup.

SQL queries run on the database server, which handles billions of rows efficiently. Pandas loads data into memory, which is limited. Use SQL to filter and aggregate large datasets, then load the result into pandas for visualization and modeling.

For data science, the focus is on querying (DQL/DML). But understanding table design, constraints, and data types helps you work more effectively with database engineers and understand data quality issues.

Try SQLBolt, Mode Analytics SQL Tutorial, LeetCode SQL problems, or Kaggle SQL courses. For real practice, load public datasets into a PostgreSQL database and write queries to answer business questions.