Introduction to ML Datasets Beginner

In machine learning, data is everything. The quality, quantity, and relevance of your dataset determines the upper bound of your model's performance. A great algorithm trained on poor data will always lose to a simple algorithm trained on great data.

Why Datasets Matter

The famous saying in ML is: "Garbage in, garbage out." Your model can only learn patterns that exist in the data. Choosing or creating the right dataset is often the most impactful decision in any ML project.

Key Insight: Most ML research breakthroughs come from better datasets and data practices, not just better algorithms. ImageNet enabled modern deep learning. The Pile enabled better LLMs. COCO enabled modern object detection.

Types of Datasets

TypeDescriptionExamples
LabeledEach sample has a known target/answerMNIST (digit label), IMDB (sentiment)
UnlabeledRaw data without annotationsCommonCrawl, raw images
StructuredOrganized in rows and columns (tabular)CSV files, SQL databases
UnstructuredImages, text, audio, videoImageNet, Wikipedia text
Semi-supervisedMix of labeled and unlabeledSmall labeled set + large unlabeled pool

Dataset Splits

Every ML dataset should be split into three subsets:

SplitPurposeTypical Size
Training setModel learns from this data70-80%
Validation setTune hyperparameters and select the best model10-15%
Test setFinal evaluation — only used once10-15%
Python
from sklearn.model_selection import train_test_split

# Split data: 80% train, 10% val, 10% test
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

Data Leakage

Data leakage occurs when information from outside the training set is used during model training. This leads to overly optimistic performance metrics that don't reflect real-world performance.

Common Leakage Mistakes:
  • Fitting preprocessing (scaling, encoding) on the full dataset instead of just the training set
  • Using future data to predict past events (temporal leakage)
  • Having duplicate or near-duplicate samples across train/test splits
  • Including the target variable (or a proxy) as a feature

Bias in Datasets

Datasets can contain biases that lead to unfair or harmful model behavior:

  • Selection bias: Dataset doesn't represent the real-world population
  • Label bias: Annotations reflect the annotator's biases
  • Historical bias: Data reflects past societal inequalities
  • Measurement bias: Data collection methods favor certain groups

Dataset Size Guidelines

TaskMinimum SamplesGood Size
Tabular classification100-5001,000-10,000
Image classification100/class1,000+/class
Object detection5005,000+
Text classification100/class1,000+/class
Fine-tuning LLMs1001,000-10,000
Pre-training LLMs1B+ tokens1T+ tokens

Licensing and Ethics

Always check the license of a dataset before using it. Common dataset licenses include:

  • CC0 (Public Domain): Free for any use
  • CC-BY: Free with attribution
  • CC-BY-NC: Non-commercial use only
  • Custom/Research: Academic use only, check terms

Ready to Explore Datasets?

Let's start with the classic datasets that every ML practitioner should know.

Next: Classic Datasets →