NLP Datasets Intermediate

Essential datasets for natural language processing, from benchmark suites and question answering to sentiment analysis, named entity recognition, and LLM evaluation.

GLUE / SuperGLUE (Benchmarks)

GLUE (General Language Understanding Evaluation) is a collection of 9 NLP tasks used to benchmark language models. SuperGLUE is a harder follow-up. Tasks include sentiment analysis, textual entailment, semantic similarity, and question answering.

Python
from datasets import load_dataset

# Load a GLUE task (e.g., SST-2 for sentiment)
sst2 = load_dataset("glue", "sst2")
print(sst2["train"][0])
# {'sentence': 'hide new secretions from...', 'label': 0, 'idx': 0}

SQuAD (Question Answering)

100,000+ question-answer pairs. Given a passage from Wikipedia, answer questions by extracting spans of text. SQuAD 2.0 adds unanswerable questions.

Python
from datasets import load_dataset
squad = load_dataset("squad_v2")

IMDB Reviews (Sentiment)

50,000 movie reviews, binary sentiment (positive/negative). The go-to dataset for sentiment analysis. 25K train + 25K test, evenly balanced.

Python
from datasets import load_dataset
imdb = load_dataset("imdb")
print(imdb["train"][0]["text"][:100])  # Review text
print(imdb["train"][0]["label"])       # 0=negative, 1=positive

AG News (Classification)

120,000 news articles, 4 classes. Categories: World, Sports, Business, Sci/Tech. A standard benchmark for text classification.

Python
from datasets import load_dataset
ag_news = load_dataset("ag_news")

CoNLL (Named Entity Recognition)

CoNLL-2003 is the standard NER benchmark. Annotates entities as Person (PER), Organization (ORG), Location (LOC), and Miscellaneous (MISC) in news text.

Python
from datasets import load_dataset
conll = load_dataset("conll2003")

WMT (Translation)

Workshop on Machine Translation. Annually released parallel corpora for training translation systems. Covers dozens of language pairs with millions of sentence pairs.

CommonCrawl & The Pile

CommonCrawl is a massive web crawl dataset (petabytes). The Pile is a curated 800GB pretraining dataset combining 22 diverse sources (academic papers, books, code, web text). Used to pretrain large language models.

MMLU (LLM Evaluation)

Massive Multitask Language Understanding. 15,908 multiple-choice questions across 57 subjects (STEM, humanities, social sciences, etc.). The standard benchmark for evaluating LLM knowledge and reasoning.

Python
from datasets import load_dataset
mmlu = load_dataset("cais/mmlu", "all")

HumanEval (Code)

164 programming problems. OpenAI's benchmark for evaluating code generation models. Each problem has a function signature, docstring, and test cases. Measured by pass@k metric.

NLP Datasets Summary

DatasetSizeTaskHow to Load
GLUE (SST-2)67KSentimentload_dataset("glue", "sst2")
SQuAD 2.0150KQuestion Answeringload_dataset("squad_v2")
IMDB50KSentimentload_dataset("imdb")
AG News120KClassificationload_dataset("ag_news")
CoNLL-200322KNERload_dataset("conll2003")
MMLU16KLLM Evaluationload_dataset("cais/mmlu", "all")
HumanEval164Code Generationload_dataset("openai_humaneval")

Next Up

Explore tabular and structured datasets from UCI, Kaggle, and government open data sources.

Next: Tabular Datasets →