Intermediate

NLP Tasks

This lesson covers the five core NLP tasks tested in the certification: text classification, named entity recognition, question answering, summarization, and translation. For each task, you will learn the correct model class, dataset format, and implementation pattern using Hugging Face.

Text Classification

Text classification assigns a label to an entire text sequence. Common applications include sentiment analysis, topic classification, intent detection, and spam filtering.

# Text Classification with Hugging Face
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer

# Quick inference with pipeline
classifier = pipeline("text-classification",
                      model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier("This product is wonderful!")
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Multi-label classification
classifier = pipeline("text-classification",
                      model="cardiffnlp/twitter-roberta-base-emotion",
                      top_k=None)  # Return all labels with scores

# Fine-tuning for custom classification
from datasets import load_dataset

dataset = load_dataset("ag_news")  # 4-class news classification
# Classes: World, Sports, Business, Sci/Tech

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=4,
    id2label={0: "World", 1: "Sports", 2: "Business", 3: "Sci/Tech"},
    label2id={"World": 0, "Sports": 1, "Business": 2, "Sci/Tech": 3}
)

# Key models for classification:
classification_models = {
    "distilbert-base-uncased": "Fast, good for most tasks",
    "roberta-base": "Better accuracy than BERT",
    "albert-base-v2": "Smallest BERT variant",
    "xlm-roberta-base": "Multilingual classification"
}

Named Entity Recognition (NER)

NER identifies and classifies named entities (persons, organizations, locations, etc.) in text. It is a token-level classification task.

# Named Entity Recognition with Hugging Face
from transformers import pipeline, AutoModelForTokenClassification

# Quick NER with pipeline
ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english",
               aggregation_strategy="simple")
entities = ner("Hugging Face was founded in New York by Clement Delangue.")
# [{'entity_group': 'ORG', 'word': 'Hugging Face', 'score': 0.99},
#  {'entity_group': 'LOC', 'word': 'New York', 'score': 0.99},
#  {'entity_group': 'PER', 'word': 'Clement Delangue', 'score': 0.98}]

# Fine-tuning NER
from datasets import load_dataset

dataset = load_dataset("conll2003")
# Labels: O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, B-MISC, I-MISC

label_list = dataset["train"].features["ner_tags"].feature.names

model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-cased", num_labels=len(label_list)
)

# NER tokenization requires special handling for subwords
def tokenize_and_align_labels(examples):
    tokenized = tokenizer(examples["tokens"], truncation=True,
                          is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized.word_ids(batch_index=i)
        label_ids = []
        previous_word_idx = None
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)  # Ignore special tokens
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)  # Ignore subword tokens
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized["labels"] = labels
    return tokenized

# aggregation_strategy options:
strategies = {
    "none": "Return raw token-level predictions",
    "simple": "Group tokens of same entity, average scores",
    "first": "Use score of first token in entity",
    "average": "Average scores of all tokens in entity",
    "max": "Use maximum score among entity tokens"
}

Question Answering

Extractive QA finds the answer span within a given context passage. The model predicts start and end positions of the answer in the context.

# Extractive Question Answering
from transformers import pipeline, AutoModelForQuestionAnswering

# Quick QA with pipeline
qa = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
result = qa(
    question="What is Hugging Face?",
    context="Hugging Face is an AI company that builds tools for machine learning. "
            "They created the Transformers library which is used by thousands of companies."
)
# {'answer': 'an AI company that builds tools for machine learning',
#  'score': 0.92, 'start': 18, 'end': 67}

# Handling long contexts (longer than max_length)
long_qa = pipeline("question-answering",
                   model="deepset/roberta-base-squad2",
                   handle_impossible_answers=True)  # SQuAD 2.0 support

# For contexts longer than model max length, use stride
result = long_qa(
    question="What is the capital?",
    context=very_long_text,
    max_seq_len=384,      # Max total length
    doc_stride=128         # Overlap between chunks
)

# Key QA models:
qa_models = {
    "distilbert-base-cased-distilled-squad": "Fast, SQuAD 1.1",
    "deepset/roberta-base-squad2": "Handles unanswerable questions",
    "deepset/deberta-v3-base-squad2": "Best accuracy on SQuAD 2.0"
}

Summarization

# Abstractive Summarization
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

article = """
Hugging Face has released a new version of the Transformers library that includes
support for over 200,000 pre-trained models. The library now supports multimodal
models, including vision transformers and audio models. The update also includes
improved documentation and tutorials for new users.
"""

summary = summarizer(article,
                     max_length=60,     # Max output length
                     min_length=20,     # Min output length
                     do_sample=False)   # Deterministic (greedy)

# Key summarization models:
summarization_models = {
    "facebook/bart-large-cnn": "Best for news articles",
    "google/pegasus-xsum": "Best for extreme summarization",
    "philschmid/bart-large-cnn-samsum": "Best for dialogue summarization",
    "google/flan-t5-base": "Good general-purpose seq2seq"
}

Translation

# Machine Translation
from transformers import pipeline

# English to French
translator = pipeline("translation_en_to_fr",
                      model="Helsinki-NLP/opus-mt-en-fr")
result = translator("Hugging Face is the best NLP library.")
# [{'translation_text': "Hugging Face est la meilleure bibliothèque NLP."}]

# For language pairs without a direct pipeline name
translator = pipeline("translation",
                      model="Helsinki-NLP/opus-mt-en-de")  # English to German

# Multilingual translation with mBART
from transformers import MBartForConditionalGeneration, MBart50Tokenizer

model = MBartForConditionalGeneration.from_pretrained(
    "facebook/mbart-large-50-many-to-many-mmt"
)
tokenizer = MBart50Tokenizer.from_pretrained(
    "facebook/mbart-large-50-many-to-many-mmt"
)

tokenizer.src_lang = "en_XX"
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
generated = model.generate(**inputs,
                           forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"])
translation = tokenizer.batch_decode(generated, skip_special_tokens=True)

# Key translation models:
translation_models = {
    "Helsinki-NLP/opus-mt-*": "Language-pair specific (fast, accurate)",
    "facebook/mbart-large-50": "50-language multilingual model",
    "facebook/nllb-200-distilled-600M": "200 languages, distilled",
    "google/flan-t5-base": "Can translate with instruction prompting"
}

Practice Questions

💡
Test your knowledge of NLP tasks with Hugging Face:
Q1: What AutoModel class should you use for NER?

Answer: AutoModelForTokenClassification. NER is a token-level classification task, where each token in the input receives a label (e.g., B-PER, I-PER, O). This is different from AutoModelForSequenceClassification, which assigns one label to the entire sequence.

Q2: Why do NER models need special tokenization handling for subwords?

Answer: Subword tokenizers (WordPiece, BPE) may split a single word into multiple tokens. For example, "Delangue" might become ["Del", "##ang", "##ue"]. Since labels are per-word, you must align labels to subword tokens. The common approach is to assign the word label to the first subword token and -100 (ignore index) to subsequent subword tokens.

Q3: What is the difference between extractive and abstractive summarization?

Answer: Extractive summarization selects and concatenates the most important sentences directly from the source text. Abstractive summarization generates new sentences that may not appear in the source, paraphrasing and condensing the content. Hugging Face primarily uses abstractive summarization models (BART, PEGASUS, T5).

Q4: How does extractive QA work under the hood?

Answer: The model receives a concatenated input of [question, context] and predicts two probability distributions over all tokens: one for the start position and one for the end position of the answer span. The answer is extracted as the substring from the start token to the end token in the context. The score is the product of the start and end logit probabilities.

Q5: When would you use mBART over Helsinki-NLP/opus-mt models for translation?

Answer: Use opus-mt models when you have a specific language pair (e.g., en-fr) and want fast, accurate, lightweight translation. Use mBART when you need a single model that can translate between many language pairs (50+ languages), when the specific language pair does not have an opus-mt model, or when you need many-to-many translation capabilities.

Key Takeaways

💡
  • Each NLP task has a specific AutoModel class — use SequenceClassification for text-level, TokenClassification for NER, QuestionAnswering for extractive QA
  • NER requires special subword-to-label alignment during tokenization
  • Extractive QA predicts start and end positions in the context; use doc_stride for long contexts
  • Summarization models (BART, PEGASUS) generate abstractive summaries with configurable length
  • Choose task-specific models (opus-mt) for speed or multilingual models (mBART) for flexibility