Hugging Face Datasets
Master the datasets library that powers the ML ecosystem. Learn to load from the Hub, process and transform data efficiently, stream massive datasets, and create your own datasets for sharing.
Your Learning Path
Follow these lessons in order, or jump to any topic that interests you.
1. Introduction
Overview of the datasets library, Arrow-backed storage, and the Hugging Face Hub ecosystem.
2. Loading Datasets
Load from the Hub, local files, pandas DataFrames, and custom data sources.
3. Processing
Map, filter, sort, shuffle, rename, concatenate, and batch-process datasets efficiently.
4. Streaming
Process datasets larger than memory with streaming mode, iterable datasets, and lazy loading.
5. Creating Datasets
Build custom datasets, define features schemas, upload to the Hub, and share with the community.
6. Best Practices
Performance optimization, caching strategies, integration with training frameworks, and large-scale tips.
What You'll Learn
By the end of this course, you'll be able to:
Load Any Dataset
Access 100,000+ datasets from the Hugging Face Hub or load from any local file format.
Process Efficiently
Transform datasets with zero-copy operations, parallel processing, and memory-mapped storage.
Handle Large Data
Stream terabyte-scale datasets without downloading everything using iterable datasets.
Share Your Work
Create, document, and publish your own datasets to the Hugging Face Hub for the community.
Lilly Tech Systems