NLP Datasets Intermediate
Essential datasets for natural language processing, from benchmark suites and question answering to sentiment analysis, named entity recognition, and LLM evaluation.
GLUE / SuperGLUE (Benchmarks)
GLUE (General Language Understanding Evaluation) is a collection of 9 NLP tasks used to benchmark language models. SuperGLUE is a harder follow-up. Tasks include sentiment analysis, textual entailment, semantic similarity, and question answering.
from datasets import load_dataset # Load a GLUE task (e.g., SST-2 for sentiment) sst2 = load_dataset("glue", "sst2") print(sst2["train"][0]) # {'sentence': 'hide new secretions from...', 'label': 0, 'idx': 0}
SQuAD (Question Answering)
100,000+ question-answer pairs. Given a passage from Wikipedia, answer questions by extracting spans of text. SQuAD 2.0 adds unanswerable questions.
from datasets import load_dataset squad = load_dataset("squad_v2")
IMDB Reviews (Sentiment)
50,000 movie reviews, binary sentiment (positive/negative). The go-to dataset for sentiment analysis. 25K train + 25K test, evenly balanced.
from datasets import load_dataset imdb = load_dataset("imdb") print(imdb["train"][0]["text"][:100]) # Review text print(imdb["train"][0]["label"]) # 0=negative, 1=positive
AG News (Classification)
120,000 news articles, 4 classes. Categories: World, Sports, Business, Sci/Tech. A standard benchmark for text classification.
from datasets import load_dataset ag_news = load_dataset("ag_news")
CoNLL (Named Entity Recognition)
CoNLL-2003 is the standard NER benchmark. Annotates entities as Person (PER), Organization (ORG), Location (LOC), and Miscellaneous (MISC) in news text.
from datasets import load_dataset conll = load_dataset("conll2003")
WMT (Translation)
Workshop on Machine Translation. Annually released parallel corpora for training translation systems. Covers dozens of language pairs with millions of sentence pairs.
CommonCrawl & The Pile
CommonCrawl is a massive web crawl dataset (petabytes). The Pile is a curated 800GB pretraining dataset combining 22 diverse sources (academic papers, books, code, web text). Used to pretrain large language models.
MMLU (LLM Evaluation)
Massive Multitask Language Understanding. 15,908 multiple-choice questions across 57 subjects (STEM, humanities, social sciences, etc.). The standard benchmark for evaluating LLM knowledge and reasoning.
from datasets import load_dataset mmlu = load_dataset("cais/mmlu", "all")
HumanEval (Code)
164 programming problems. OpenAI's benchmark for evaluating code generation models. Each problem has a function signature, docstring, and test cases. Measured by pass@k metric.
NLP Datasets Summary
| Dataset | Size | Task | How to Load |
|---|---|---|---|
| GLUE (SST-2) | 67K | Sentiment | load_dataset("glue", "sst2") |
| SQuAD 2.0 | 150K | Question Answering | load_dataset("squad_v2") |
| IMDB | 50K | Sentiment | load_dataset("imdb") |
| AG News | 120K | Classification | load_dataset("ag_news") |
| CoNLL-2003 | 22K | NER | load_dataset("conll2003") |
| MMLU | 16K | LLM Evaluation | load_dataset("cais/mmlu", "all") |
| HumanEval | 164 | Code Generation | load_dataset("openai_humaneval") |
Next Up
Explore tabular and structured datasets from UCI, Kaggle, and government open data sources.
Next: Tabular Datasets →
Lilly Tech Systems