Beginner

Introduction to R for Data Science

Understand why R is a top choice for data science, discover the tidyverse ecosystem, and set up your data science environment.

R for Data Science Overview

R is one of the two dominant languages in data science (alongside Python). What sets R apart is its deep integration with statistical methods and its world-class data visualization capabilities through ggplot2.

This course focuses on the tidyverse — a coherent collection of packages that share a common philosophy for data science in R. You will learn to import, tidy, transform, visualize, and communicate data effectively.

The Tidyverse Ecosystem

The tidyverse is a collection of R packages designed by Hadley Wickham and the team at Posit (formerly RStudio). The core packages include:

ggplot2 — Data visualization using the grammar of graphics
dplyr — Data manipulation (filter, select, mutate, summarise)
tidyr — Data tidying (reshaping and cleaning)
readr — Fast data import (CSV, TSV)
purrr — Functional programming with lists and vectors
tibble — Modern data frames
stringr — String manipulation
forcats — Factor (categorical data) handling

Why R for Data Science?

ggplot2: The most powerful and flexible visualization system in any language.
dplyr: Intuitive, readable syntax for data manipulation that reads like English.
Shiny: Build interactive web dashboards directly from R without knowing HTML/CSS/JS.
R Markdown: Combine code, results, and narrative in reproducible documents.
Statistical depth: Access to cutting-edge statistical methods before they appear in other languages.

R vs Python for Data Science

Aspect	R	Python
Visualization	ggplot2 (superior for static plots)	matplotlib, seaborn, plotly
Data wrangling	dplyr + tidyr (very readable)	pandas (powerful but verbose)
Statistics	Unmatched depth and breadth	scipy, statsmodels
ML engineering	tidymodels, caret	scikit-learn, TensorFlow (larger ecosystem)
Dashboards	Shiny (easy, R-native)	Streamlit, Dash
Reporting	R Markdown, Quarto	Jupyter notebooks

Setting Up Your DS Environment

# Install the entire tidyverse
install.packages("tidyverse")

# Additional useful DS packages
install.packages(c(
  "readxl",       # Excel files
  "janitor",      # Data cleaning helpers
  "skimr",        # Quick data summaries
  "lubridate",    # Dates and times
  "scales",       # Formatting for ggplot2
  "plotly",       # Interactive plots
  "DT"            # Interactive tables
))

# Load the tidyverse
library(tidyverse)

Hadley Wickham and the Tidy Data Philosophy

Hadley Wickham is the Chief Scientist at Posit and the architect of the tidyverse. His 2014 paper "Tidy Data" established the principles that guide modern R data science:

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table

When data is in "tidy" format, it becomes dramatically easier to visualize, model, and transform. The entire tidyverse is built around this principle.

📚

Prerequisites: This course assumes you have completed the Basics of R course or have equivalent experience with R fundamentals (variables, functions, data structures).

Next → Tidyverse