Introduction to ML Datasets Beginner

In machine learning, data is everything. The quality, quantity, and relevance of your dataset determines the upper bound of your model's performance. A great algorithm trained on poor data will always lose to a simple algorithm trained on great data.

Why Datasets Matter

The famous saying in ML is: "Garbage in, garbage out." Your model can only learn patterns that exist in the data. Choosing or creating the right dataset is often the most impactful decision in any ML project.

Key Insight: Most ML research breakthroughs come from better datasets and data practices, not just better algorithms. ImageNet enabled modern deep learning. The Pile enabled better LLMs. COCO enabled modern object detection.

Types of Datasets

Type	Description	Examples
Labeled	Each sample has a known target/answer	MNIST (digit label), IMDB (sentiment)
Unlabeled	Raw data without annotations	CommonCrawl, raw images
Structured	Organized in rows and columns (tabular)	CSV files, SQL databases
Unstructured	Images, text, audio, video	ImageNet, Wikipedia text
Semi-supervised	Mix of labeled and unlabeled	Small labeled set + large unlabeled pool

Dataset Splits

Every ML dataset should be split into three subsets:

Split	Purpose	Typical Size
Training set	Model learns from this data	70-80%
Validation set	Tune hyperparameters and select the best model	10-15%
Test set	Final evaluation — only used once	10-15%

Python

from sklearn.model_selection import train_test_split

# Split data: 80% train, 10% val, 10% test
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

Data Leakage

Data leakage occurs when information from outside the training set is used during model training. This leads to overly optimistic performance metrics that don't reflect real-world performance.

Common Leakage Mistakes:

Fitting preprocessing (scaling, encoding) on the full dataset instead of just the training set
Using future data to predict past events (temporal leakage)
Having duplicate or near-duplicate samples across train/test splits
Including the target variable (or a proxy) as a feature

Bias in Datasets

Datasets can contain biases that lead to unfair or harmful model behavior:

Selection bias: Dataset doesn't represent the real-world population
Label bias: Annotations reflect the annotator's biases
Historical bias: Data reflects past societal inequalities
Measurement bias: Data collection methods favor certain groups

Dataset Size Guidelines

Task	Minimum Samples	Good Size
Tabular classification	100-500	1,000-10,000
Image classification	100/class	1,000+/class
Object detection	500	5,000+
Text classification	100/class	1,000+/class
Fine-tuning LLMs	100	1,000-10,000
Pre-training LLMs	1B+ tokens	1T+ tokens

Licensing and Ethics

Always check the license of a dataset before using it. Common dataset licenses include:

CC0 (Public Domain): Free for any use
CC-BY: Free with attribution
CC-BY-NC: Non-commercial use only
Custom/Research: Academic use only, check terms

Ready to Explore Datasets?

Let's start with the classic datasets that every ML practitioner should know.

Next: Classic Datasets →

← Course Overview Classic Datasets →