Introduction to ML Datasets Beginner
In machine learning, data is everything. The quality, quantity, and relevance of your dataset determines the upper bound of your model's performance. A great algorithm trained on poor data will always lose to a simple algorithm trained on great data.
Why Datasets Matter
The famous saying in ML is: "Garbage in, garbage out." Your model can only learn patterns that exist in the data. Choosing or creating the right dataset is often the most impactful decision in any ML project.
Types of Datasets
| Type | Description | Examples |
|---|---|---|
| Labeled | Each sample has a known target/answer | MNIST (digit label), IMDB (sentiment) |
| Unlabeled | Raw data without annotations | CommonCrawl, raw images |
| Structured | Organized in rows and columns (tabular) | CSV files, SQL databases |
| Unstructured | Images, text, audio, video | ImageNet, Wikipedia text |
| Semi-supervised | Mix of labeled and unlabeled | Small labeled set + large unlabeled pool |
Dataset Splits
Every ML dataset should be split into three subsets:
| Split | Purpose | Typical Size |
|---|---|---|
| Training set | Model learns from this data | 70-80% |
| Validation set | Tune hyperparameters and select the best model | 10-15% |
| Test set | Final evaluation — only used once | 10-15% |
from sklearn.model_selection import train_test_split # Split data: 80% train, 10% val, 10% test X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42) X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
Data Leakage
Data leakage occurs when information from outside the training set is used during model training. This leads to overly optimistic performance metrics that don't reflect real-world performance.
- Fitting preprocessing (scaling, encoding) on the full dataset instead of just the training set
- Using future data to predict past events (temporal leakage)
- Having duplicate or near-duplicate samples across train/test splits
- Including the target variable (or a proxy) as a feature
Bias in Datasets
Datasets can contain biases that lead to unfair or harmful model behavior:
- Selection bias: Dataset doesn't represent the real-world population
- Label bias: Annotations reflect the annotator's biases
- Historical bias: Data reflects past societal inequalities
- Measurement bias: Data collection methods favor certain groups
Dataset Size Guidelines
| Task | Minimum Samples | Good Size |
|---|---|---|
| Tabular classification | 100-500 | 1,000-10,000 |
| Image classification | 100/class | 1,000+/class |
| Object detection | 500 | 5,000+ |
| Text classification | 100/class | 1,000+/class |
| Fine-tuning LLMs | 100 | 1,000-10,000 |
| Pre-training LLMs | 1B+ tokens | 1T+ tokens |
Licensing and Ethics
Always check the license of a dataset before using it. Common dataset licenses include:
- CC0 (Public Domain): Free for any use
- CC-BY: Free with attribution
- CC-BY-NC: Non-commercial use only
- Custom/Research: Academic use only, check terms
Ready to Explore Datasets?
Let's start with the classic datasets that every ML practitioner should know.
Next: Classic Datasets →
Lilly Tech Systems