Best Practices
Write efficient, readable SQL and connect databases to Python for seamless data science workflows.
SQL Style Guide
- Use UPPERCASE for SQL keywords:
SELECT,FROM,WHERE. - Use snake_case for table and column names:
order_items,user_id. - Put each major clause on its own line:
SELECT,FROM,WHERE,GROUP BY. - Indent sub-clauses and columns for readability.
- Use meaningful table aliases:
orders o,users u. - Always use explicit
JOINsyntax instead of comma-separated tables.
Query Optimization
Select Only Needed Columns
Avoid
SELECT *. Specify only the columns you need to reduce data transfer and memory usage.Filter Early
Apply WHERE conditions to reduce the number of rows processed before JOINs and aggregations.
Use EXPLAIN
Run
EXPLAIN ANALYZEbefore your query to see the execution plan and identify bottlenecks.Index Wisely
Create indexes on columns used in WHERE, JOIN, and ORDER BY clauses. But don't over-index — each index slows writes.
Understanding EXPLAIN
EXPLAIN ANALYZE SELECT u.name, COUNT(*) AS order_count FROM users u JOIN orders o ON u.id = o.user_id WHERE o.order_date >= '2024-01-01' GROUP BY u.name ORDER BY order_count DESC LIMIT 10;
Connecting SQL to Python
import pandas as pd from sqlalchemy import create_engine # Create database connection engine = create_engine("postgresql://user:pass@localhost:5432/mydb") # Read SQL query into a DataFrame query = """ SELECT category, AVG(price) AS avg_price, COUNT(*) AS count FROM products GROUP BY category HAVING COUNT(*) > 5 ORDER BY avg_price DESC """ df = pd.read_sql(query, engine) print(df.head()) # Write DataFrame back to database df.to_sql("summary_table", engine, if_exists="replace", index=False)
import sqlite3 import pandas as pd # Connect to SQLite file conn = sqlite3.connect("data.db") # Read into DataFrame df = pd.read_sql("SELECT * FROM sales WHERE year = 2024", conn) # Always close the connection conn.close()
Common Pitfalls
- NULL comparisons: Use
IS NULL/IS NOT NULL, never= NULL. - Cartesian products: Always include a JOIN condition to avoid accidental cross joins.
- GROUP BY errors: Every non-aggregated column in SELECT must be in GROUP BY.
- Integer division: In some databases,
5/2 = 2. Cast to float:5.0/2orCAST(5 AS FLOAT)/2. - SQL injection: Never concatenate user input into SQL strings. Use parameterized queries.
- N+1 queries: Avoid running queries in loops. Use JOINs or batch operations instead.
Frequently Asked Questions
PostgreSQL is the best choice for data science. It has excellent support for window functions, CTEs, JSON, and advanced analytics. SQLite is great for quick prototyping without server setup.
SQL queries run on the database server, which handles billions of rows efficiently. Pandas loads data into memory, which is limited. Use SQL to filter and aggregate large datasets, then load the result into pandas for visualization and modeling.
For data science, the focus is on querying (DQL/DML). But understanding table design, constraints, and data types helps you work more effectively with database engineers and understand data quality issues.
Try SQLBolt, Mode Analytics SQL Tutorial, LeetCode SQL problems, or Kaggle SQL courses. For real practice, load public datasets into a PostgreSQL database and write queries to answer business questions.
Lilly Tech Systems