Intermediate

Tools & Libraries

The data science ecosystem includes powerful development environments, libraries, cloud platforms, and collaboration tools. This lesson covers the essential tools every data scientist should know.

Development Environments

📓

Jupyter Notebook / Lab

The standard interactive environment for data science. Combine code, output, and markdown text in a single document. JupyterLab adds a full IDE-like interface with file browser and terminals.

☁

Google Colab

Free, cloud-based Jupyter notebooks with GPU/TPU access. No installation required — runs entirely in the browser. Great for learning and prototyping.

💻

VS Code

Full-featured code editor with Python and Jupyter extensions. Offers debugging, Git integration, and IntelliSense. Ideal for production-ready data science code.

📄

RStudio

Specialized IDE for R programming. Includes data viewer, plot pane, and package management. Popular in academic and biostatistics contexts.

Terminal

# Install JupyterLab
pip install jupyterlab

# Launch JupyterLab
jupyter lab

# Install VS Code Jupyter extension
code --install-extension ms-toolsai.jupyter

Core Python Libraries

Library	Purpose	Install
Pandas	Data manipulation and analysis with DataFrames	`pip install pandas`
NumPy	Numerical computing with fast arrays	`pip install numpy`
Matplotlib	Static, animated, and interactive visualizations	`pip install matplotlib`
Seaborn	Statistical data visualization built on Matplotlib	`pip install seaborn`
Scikit-learn	Machine learning algorithms and model evaluation	`pip install scikit-learn`
SciPy	Scientific computing and statistics	`pip install scipy`
Plotly	Interactive, web-based visualizations	`pip install plotly`
Statsmodels	Statistical modeling and hypothesis testing	`pip install statsmodels`

✅

Quick setup: Install everything at once with pip install pandas numpy matplotlib seaborn scikit-learn scipy plotly statsmodels jupyterlab or use Anaconda which includes most of these by default.

Deep Learning Libraries

Library	Best For	Key Feature
TensorFlow	Production ML, mobile/web deployment	TensorFlow Serving, TF Lite, comprehensive ecosystem
PyTorch	Research, rapid prototyping	Dynamic computation graphs, intuitive API
Keras	Quick model building (part of TensorFlow)	High-level API, easy to learn

SQL Basics for Data Science

SQL (Structured Query Language) is essential for querying databases. Most real-world data lives in SQL databases.

SQL

-- Select specific columns with filtering
SELECT name, department, salary
FROM employees
WHERE salary > 50000
ORDER BY salary DESC;

-- Aggregation with GROUP BY
SELECT department,
       COUNT(*) AS num_employees,
       AVG(salary) AS avg_salary,
       MAX(salary) AS max_salary
FROM employees
GROUP BY department
HAVING COUNT(*) > 5;

-- JOIN tables
SELECT e.name, d.department_name, e.salary
FROM employees e
JOIN departments d ON e.dept_id = d.id;

Cloud Platforms

Cloud platforms provide scalable computing, managed services, and collaborative tools for data science at scale.

Platform	Data Science Services	Best For
AWS	SageMaker, Redshift, S3, EMR	Enterprise, widest service range
Google Cloud	BigQuery, Vertex AI, Dataflow	Big data analytics, AI/ML
Azure	Azure ML, Synapse, Databricks	Microsoft ecosystem integration

Version Control for Data Projects

Version control is not just for software engineers. Data scientists need it too — for code, configurations, and experiment tracking.

Terminal

# Initialize a Git repository
git init

# Create a .gitignore for data science
# Ignore large data files, credentials, and environment
echo "data/raw/
*.csv
*.parquet
.env
__pycache__/
.ipynb_checkpoints/" > .gitignore

# Track your notebooks and scripts
git add notebooks/ src/ requirements.txt
git commit -m "Add initial analysis notebooks"

💡

Data versioning: For tracking large datasets and model files, consider tools like DVC (Data Version Control), MLflow, or Weights & Biases which are designed specifically for ML experiment tracking.

← Previous Data Visualization Next → Best Practices