Introduction to R for Data Science
Understand why R is a top choice for data science, discover the tidyverse ecosystem, and set up your data science environment.
R for Data Science Overview
R is one of the two dominant languages in data science (alongside Python). What sets R apart is its deep integration with statistical methods and its world-class data visualization capabilities through ggplot2.
This course focuses on the tidyverse — a coherent collection of packages that share a common philosophy for data science in R. You will learn to import, tidy, transform, visualize, and communicate data effectively.
The Tidyverse Ecosystem
The tidyverse is a collection of R packages designed by Hadley Wickham and the team at Posit (formerly RStudio). The core packages include:
- ggplot2 — Data visualization using the grammar of graphics
- dplyr — Data manipulation (filter, select, mutate, summarise)
- tidyr — Data tidying (reshaping and cleaning)
- readr — Fast data import (CSV, TSV)
- purrr — Functional programming with lists and vectors
- tibble — Modern data frames
- stringr — String manipulation
- forcats — Factor (categorical data) handling
Why R for Data Science?
- ggplot2: The most powerful and flexible visualization system in any language.
- dplyr: Intuitive, readable syntax for data manipulation that reads like English.
- Shiny: Build interactive web dashboards directly from R without knowing HTML/CSS/JS.
- R Markdown: Combine code, results, and narrative in reproducible documents.
- Statistical depth: Access to cutting-edge statistical methods before they appear in other languages.
R vs Python for Data Science
| Aspect | R | Python |
|---|---|---|
| Visualization | ggplot2 (superior for static plots) | matplotlib, seaborn, plotly |
| Data wrangling | dplyr + tidyr (very readable) | pandas (powerful but verbose) |
| Statistics | Unmatched depth and breadth | scipy, statsmodels |
| ML engineering | tidymodels, caret | scikit-learn, TensorFlow (larger ecosystem) |
| Dashboards | Shiny (easy, R-native) | Streamlit, Dash |
| Reporting | R Markdown, Quarto | Jupyter notebooks |
Setting Up Your DS Environment
# Install the entire tidyverse install.packages("tidyverse") # Additional useful DS packages install.packages(c( "readxl", # Excel files "janitor", # Data cleaning helpers "skimr", # Quick data summaries "lubridate", # Dates and times "scales", # Formatting for ggplot2 "plotly", # Interactive plots "DT" # Interactive tables )) # Load the tidyverse library(tidyverse)
Hadley Wickham and the Tidy Data Philosophy
Hadley Wickham is the Chief Scientist at Posit and the architect of the tidyverse. His 2014 paper "Tidy Data" established the principles that guide modern R data science:
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table
When data is in "tidy" format, it becomes dramatically easier to visualize, model, and transform. The entire tidyverse is built around this principle.
Lilly Tech Systems