Intermediate

Hugging Face Ecosystem

Hugging Face has become the central hub for NLP and AI. Its open-source libraries and model repository make state-of-the-art NLP accessible to everyone.

The Hugging Face Platform

Hugging Face provides an integrated ecosystem of tools for working with machine learning models:

Transformers Library: Python library for downloading and using pretrained models
Model Hub: Repository of 500,000+ pretrained models
Datasets Library: Access to 100,000+ datasets
Tokenizers Library: Fast tokenization implementations
Spaces: Host ML demo applications for free

The Transformers Library

The transformers library is the core of the Hugging Face ecosystem. It provides a unified API for working with thousands of pretrained models.

Installation

# Install the transformers library
pip install transformers torch
# Or with TensorFlow backend
pip install transformers tensorflow

Model Hub: Browse, Download, and Use

The Model Hub at huggingface.co/models hosts hundreds of thousands of pretrained models. You can filter by task, framework, language, and license.

Python - Loading Models

from transformers import AutoTokenizer, AutoModel

# Load any model from the Hub by name
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Tokenize and get model output
inputs = tokenizer("Hello, Hugging Face!", return_tensors="pt")
outputs = model(**inputs)

print(outputs.last_hidden_state.shape)
# torch.Size([1, 6, 768])

Pipeline API for Easy Inference

The pipeline API is the simplest way to use pretrained models. It handles tokenization, inference, and post-processing automatically.

Python - Pipeline Examples

from transformers import pipeline

# Sentiment Analysis
classifier = pipeline("sentiment-analysis")
print(classifier("Hugging Face is amazing!"))
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Named Entity Recognition
ner = pipeline("ner", grouped_entities=True)
print(ner("Hugging Face is based in New York City."))
# [{'entity_group': 'ORG', 'word': 'Hugging Face', ...},
#  {'entity_group': 'LOC', 'word': 'New York City', ...}]

# Question Answering
qa = pipeline("question-answering")
print(qa(
    question="What does HF provide?",
    context="Hugging Face provides tools for building NLP applications."
))
# {'answer': 'tools for building NLP applications', 'score': 0.95}

# Summarization
summarizer = pipeline("summarization")

# Translation
translator = pipeline("translation_en_to_fr")

# Text Generation
generator = pipeline("text-generation")

# Zero-Shot Classification
zero_shot = pipeline("zero-shot-classification")
print(zero_shot(
    "I want to buy a new laptop",
    candidate_labels=["technology", "food", "sports"]
))

Tokenizers Library

The tokenizers library provides ultra-fast tokenization implementations in Rust. It supports BPE, WordPiece, and Unigram tokenization algorithms.

Python - Fast Tokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize text
encoding = tokenizer("Hugging Face makes NLP easy!", padding=True, truncation=True)

print("Tokens:", tokenizer.convert_ids_to_tokens(encoding["input_ids"]))
print("IDs:", encoding["input_ids"])
print("Attention Mask:", encoding["attention_mask"])

Datasets Library

The datasets library provides access to thousands of datasets with a simple, consistent API:

Python - Loading Datasets

from datasets import load_dataset

# Load a popular NLP dataset
dataset = load_dataset("imdb")

print(dataset)
# DatasetDict({
#     train: Dataset({features: ['text', 'label'], num_rows: 25000})
#     test: Dataset({features: ['text', 'label'], num_rows: 25000})
# })

# Access a sample
print(dataset["train"][0]["text"][:100])
print(dataset["train"][0]["label"])  # 0 = negative, 1 = positive

Fine-Tuning with Trainer API

The Trainer API simplifies the fine-tuning process, handling training loops, evaluation, logging, and checkpointing:

Python - Fine-Tuning

from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer
)
from datasets import load_dataset

# Load model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Load and tokenize dataset
dataset = load_dataset("imdb")

def tokenize(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

tokenized = dataset.map(tokenize, batched=True)

# Define training arguments
args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
)

# Create Trainer and train
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
)

trainer.train()

Model Sharing

After fine-tuning, you can share your model on the Hub for others to use:

Python - Sharing Models

# Login to Hugging Face
from huggingface_hub import login
login(token="your_token_here")

# Push model and tokenizer to the Hub
model.push_to_hub("my-sentiment-model")
tokenizer.push_to_hub("my-sentiment-model")

# Anyone can now use your model:
# pipeline("sentiment-analysis", model="your-username/my-sentiment-model")

Spaces for Demos

Hugging Face Spaces lets you host interactive ML demos using Gradio or Streamlit. You can create a live demo of your model in minutes:

Python - Gradio Demo

import gradio as gr
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

def analyze(text):
    result = classifier(text)[0]
    return f"{result['label']}: {result['score']:.3f}"

demo = gr.Interface(
    fn=analyze,
    inputs=gr.Textbox(label="Enter text"),
    outputs=gr.Textbox(label="Sentiment"),
    title="Sentiment Analysis Demo"
)

demo.launch()

✅

Key takeaway: Hugging Face has democratized NLP by making state-of-the-art models accessible through simple Python APIs. The Pipeline API lets you use powerful models in just a few lines of code, while the Trainer API makes fine-tuning straightforward.

← Previous Transformers & LLMs Next → Best Practices