Tools & Libraries
The data science ecosystem includes powerful development environments, libraries, cloud platforms, and collaboration tools. This lesson covers the essential tools every data scientist should know.
Development Environments
Jupyter Notebook / Lab
The standard interactive environment for data science. Combine code, output, and markdown text in a single document. JupyterLab adds a full IDE-like interface with file browser and terminals.
Google Colab
Free, cloud-based Jupyter notebooks with GPU/TPU access. No installation required — runs entirely in the browser. Great for learning and prototyping.
VS Code
Full-featured code editor with Python and Jupyter extensions. Offers debugging, Git integration, and IntelliSense. Ideal for production-ready data science code.
RStudio
Specialized IDE for R programming. Includes data viewer, plot pane, and package management. Popular in academic and biostatistics contexts.
# Install JupyterLab pip install jupyterlab # Launch JupyterLab jupyter lab # Install VS Code Jupyter extension code --install-extension ms-toolsai.jupyter
Core Python Libraries
| Library | Purpose | Install |
|---|---|---|
| Pandas | Data manipulation and analysis with DataFrames | pip install pandas |
| NumPy | Numerical computing with fast arrays | pip install numpy |
| Matplotlib | Static, animated, and interactive visualizations | pip install matplotlib |
| Seaborn | Statistical data visualization built on Matplotlib | pip install seaborn |
| Scikit-learn | Machine learning algorithms and model evaluation | pip install scikit-learn |
| SciPy | Scientific computing and statistics | pip install scipy |
| Plotly | Interactive, web-based visualizations | pip install plotly |
| Statsmodels | Statistical modeling and hypothesis testing | pip install statsmodels |
pip install pandas numpy matplotlib seaborn scikit-learn scipy plotly statsmodels jupyterlab or use Anaconda which includes most of these by default.Deep Learning Libraries
| Library | Best For | Key Feature |
|---|---|---|
| TensorFlow | Production ML, mobile/web deployment | TensorFlow Serving, TF Lite, comprehensive ecosystem |
| PyTorch | Research, rapid prototyping | Dynamic computation graphs, intuitive API |
| Keras | Quick model building (part of TensorFlow) | High-level API, easy to learn |
SQL Basics for Data Science
SQL (Structured Query Language) is essential for querying databases. Most real-world data lives in SQL databases.
-- Select specific columns with filtering SELECT name, department, salary FROM employees WHERE salary > 50000 ORDER BY salary DESC; -- Aggregation with GROUP BY SELECT department, COUNT(*) AS num_employees, AVG(salary) AS avg_salary, MAX(salary) AS max_salary FROM employees GROUP BY department HAVING COUNT(*) > 5; -- JOIN tables SELECT e.name, d.department_name, e.salary FROM employees e JOIN departments d ON e.dept_id = d.id;
Cloud Platforms
Cloud platforms provide scalable computing, managed services, and collaborative tools for data science at scale.
| Platform | Data Science Services | Best For |
|---|---|---|
| AWS | SageMaker, Redshift, S3, EMR | Enterprise, widest service range |
| Google Cloud | BigQuery, Vertex AI, Dataflow | Big data analytics, AI/ML |
| Azure | Azure ML, Synapse, Databricks | Microsoft ecosystem integration |
Version Control for Data Projects
Version control is not just for software engineers. Data scientists need it too — for code, configurations, and experiment tracking.
# Initialize a Git repository git init # Create a .gitignore for data science # Ignore large data files, credentials, and environment echo "data/raw/ *.csv *.parquet .env __pycache__/ .ipynb_checkpoints/" > .gitignore # Track your notebooks and scripts git add notebooks/ src/ requirements.txt git commit -m "Add initial analysis notebooks"
Lilly Tech Systems