NLP Datasets Intermediate

Essential datasets for natural language processing, from benchmark suites and question answering to sentiment analysis, named entity recognition, and LLM evaluation.

GLUE / SuperGLUE (Benchmarks)

GLUE (General Language Understanding Evaluation) is a collection of 9 NLP tasks used to benchmark language models. SuperGLUE is a harder follow-up. Tasks include sentiment analysis, textual entailment, semantic similarity, and question answering.

Python

from datasets import load_dataset

# Load a GLUE task (e.g., SST-2 for sentiment)
sst2 = load_dataset("glue", "sst2")
print(sst2["train"][0])
# {'sentence': 'hide new secretions from...', 'label': 0, 'idx': 0}

SQuAD (Question Answering)

100,000+ question-answer pairs. Given a passage from Wikipedia, answer questions by extracting spans of text. SQuAD 2.0 adds unanswerable questions.

Python

from datasets import load_dataset
squad = load_dataset("squad_v2")

IMDB Reviews (Sentiment)

50,000 movie reviews, binary sentiment (positive/negative). The go-to dataset for sentiment analysis. 25K train + 25K test, evenly balanced.

Python

from datasets import load_dataset
imdb = load_dataset("imdb")
print(imdb["train"][0]["text"][:100])  # Review text
print(imdb["train"][0]["label"])       # 0=negative, 1=positive

AG News (Classification)

120,000 news articles, 4 classes. Categories: World, Sports, Business, Sci/Tech. A standard benchmark for text classification.

Python

from datasets import load_dataset
ag_news = load_dataset("ag_news")

CoNLL (Named Entity Recognition)

CoNLL-2003 is the standard NER benchmark. Annotates entities as Person (PER), Organization (ORG), Location (LOC), and Miscellaneous (MISC) in news text.

Python

from datasets import load_dataset
conll = load_dataset("conll2003")

WMT (Translation)

Workshop on Machine Translation. Annually released parallel corpora for training translation systems. Covers dozens of language pairs with millions of sentence pairs.

CommonCrawl & The Pile

CommonCrawl is a massive web crawl dataset (petabytes). The Pile is a curated 800GB pretraining dataset combining 22 diverse sources (academic papers, books, code, web text). Used to pretrain large language models.

MMLU (LLM Evaluation)

Massive Multitask Language Understanding. 15,908 multiple-choice questions across 57 subjects (STEM, humanities, social sciences, etc.). The standard benchmark for evaluating LLM knowledge and reasoning.

Python

from datasets import load_dataset
mmlu = load_dataset("cais/mmlu", "all")

HumanEval (Code)

164 programming problems. OpenAI's benchmark for evaluating code generation models. Each problem has a function signature, docstring, and test cases. Measured by pass@k metric.

NLP Datasets Summary

Dataset	Size	Task	How to Load
GLUE (SST-2)	67K	Sentiment	`load_dataset("glue", "sst2")`
SQuAD 2.0	150K	Question Answering	`load_dataset("squad_v2")`
IMDB	50K	Sentiment	`load_dataset("imdb")`
AG News	120K	Classification	`load_dataset("ag_news")`
CoNLL-2003	22K	NER	`load_dataset("conll2003")`
MMLU	16K	LLM Evaluation	`load_dataset("cais/mmlu", "all")`
HumanEval	164	Code Generation	`load_dataset("openai_humaneval")`

Next Up

Explore tabular and structured datasets from UCI, Kaggle, and government open data sources.

Next: Tabular Datasets →

← Computer Vision Datasets Tabular Datasets →