Intermediate

Tools & Libraries

The data science ecosystem includes powerful development environments, libraries, cloud platforms, and collaboration tools. This lesson covers the essential tools every data scientist should know.

Development Environments

📓

Jupyter Notebook / Lab

The standard interactive environment for data science. Combine code, output, and markdown text in a single document. JupyterLab adds a full IDE-like interface with file browser and terminals.

Google Colab

Free, cloud-based Jupyter notebooks with GPU/TPU access. No installation required — runs entirely in the browser. Great for learning and prototyping.

💻

VS Code

Full-featured code editor with Python and Jupyter extensions. Offers debugging, Git integration, and IntelliSense. Ideal for production-ready data science code.

📄

RStudio

Specialized IDE for R programming. Includes data viewer, plot pane, and package management. Popular in academic and biostatistics contexts.

Terminal
# Install JupyterLab
pip install jupyterlab

# Launch JupyterLab
jupyter lab

# Install VS Code Jupyter extension
code --install-extension ms-toolsai.jupyter

Core Python Libraries

Library Purpose Install
Pandas Data manipulation and analysis with DataFrames pip install pandas
NumPy Numerical computing with fast arrays pip install numpy
Matplotlib Static, animated, and interactive visualizations pip install matplotlib
Seaborn Statistical data visualization built on Matplotlib pip install seaborn
Scikit-learn Machine learning algorithms and model evaluation pip install scikit-learn
SciPy Scientific computing and statistics pip install scipy
Plotly Interactive, web-based visualizations pip install plotly
Statsmodels Statistical modeling and hypothesis testing pip install statsmodels
Quick setup: Install everything at once with pip install pandas numpy matplotlib seaborn scikit-learn scipy plotly statsmodels jupyterlab or use Anaconda which includes most of these by default.

Deep Learning Libraries

Library Best For Key Feature
TensorFlow Production ML, mobile/web deployment TensorFlow Serving, TF Lite, comprehensive ecosystem
PyTorch Research, rapid prototyping Dynamic computation graphs, intuitive API
Keras Quick model building (part of TensorFlow) High-level API, easy to learn

SQL Basics for Data Science

SQL (Structured Query Language) is essential for querying databases. Most real-world data lives in SQL databases.

SQL
-- Select specific columns with filtering
SELECT name, department, salary
FROM employees
WHERE salary > 50000
ORDER BY salary DESC;

-- Aggregation with GROUP BY
SELECT department,
       COUNT(*) AS num_employees,
       AVG(salary) AS avg_salary,
       MAX(salary) AS max_salary
FROM employees
GROUP BY department
HAVING COUNT(*) > 5;

-- JOIN tables
SELECT e.name, d.department_name, e.salary
FROM employees e
JOIN departments d ON e.dept_id = d.id;

Cloud Platforms

Cloud platforms provide scalable computing, managed services, and collaborative tools for data science at scale.

Platform Data Science Services Best For
AWS SageMaker, Redshift, S3, EMR Enterprise, widest service range
Google Cloud BigQuery, Vertex AI, Dataflow Big data analytics, AI/ML
Azure Azure ML, Synapse, Databricks Microsoft ecosystem integration

Version Control for Data Projects

Version control is not just for software engineers. Data scientists need it too — for code, configurations, and experiment tracking.

Terminal
# Initialize a Git repository
git init

# Create a .gitignore for data science
# Ignore large data files, credentials, and environment
echo "data/raw/
*.csv
*.parquet
.env
__pycache__/
.ipynb_checkpoints/" > .gitignore

# Track your notebooks and scripts
git add notebooks/ src/ requirements.txt
git commit -m "Add initial analysis notebooks"
💡
Data versioning: For tracking large datasets and model files, consider tools like DVC (Data Version Control), MLflow, or Weights & Biases which are designed specifically for ML experiment tracking.