Hugging Face Ecosystem
Hugging Face has become the central hub for NLP and AI. Its open-source libraries and model repository make state-of-the-art NLP accessible to everyone.
The Hugging Face Platform
Hugging Face provides an integrated ecosystem of tools for working with machine learning models:
- Transformers Library: Python library for downloading and using pretrained models
- Model Hub: Repository of 500,000+ pretrained models
- Datasets Library: Access to 100,000+ datasets
- Tokenizers Library: Fast tokenization implementations
- Spaces: Host ML demo applications for free
The Transformers Library
The transformers library is the core of the Hugging Face ecosystem. It provides a unified API for working with thousands of pretrained models.
# Install the transformers library pip install transformers torch # Or with TensorFlow backend pip install transformers tensorflow
Model Hub: Browse, Download, and Use
The Model Hub at huggingface.co/models hosts hundreds of thousands of pretrained models. You can filter by task, framework, language, and license.
from transformers import AutoTokenizer, AutoModel # Load any model from the Hub by name tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") model = AutoModel.from_pretrained("bert-base-uncased") # Tokenize and get model output inputs = tokenizer("Hello, Hugging Face!", return_tensors="pt") outputs = model(**inputs) print(outputs.last_hidden_state.shape) # torch.Size([1, 6, 768])
Pipeline API for Easy Inference
The pipeline API is the simplest way to use pretrained models. It handles tokenization, inference, and post-processing automatically.
from transformers import pipeline # Sentiment Analysis classifier = pipeline("sentiment-analysis") print(classifier("Hugging Face is amazing!")) # [{'label': 'POSITIVE', 'score': 0.9998}] # Named Entity Recognition ner = pipeline("ner", grouped_entities=True) print(ner("Hugging Face is based in New York City.")) # [{'entity_group': 'ORG', 'word': 'Hugging Face', ...}, # {'entity_group': 'LOC', 'word': 'New York City', ...}] # Question Answering qa = pipeline("question-answering") print(qa( question="What does HF provide?", context="Hugging Face provides tools for building NLP applications." )) # {'answer': 'tools for building NLP applications', 'score': 0.95} # Summarization summarizer = pipeline("summarization") # Translation translator = pipeline("translation_en_to_fr") # Text Generation generator = pipeline("text-generation") # Zero-Shot Classification zero_shot = pipeline("zero-shot-classification") print(zero_shot( "I want to buy a new laptop", candidate_labels=["technology", "food", "sports"] ))
Tokenizers Library
The tokenizers library provides ultra-fast tokenization implementations in Rust. It supports BPE, WordPiece, and Unigram tokenization algorithms.
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Tokenize text encoding = tokenizer("Hugging Face makes NLP easy!", padding=True, truncation=True) print("Tokens:", tokenizer.convert_ids_to_tokens(encoding["input_ids"])) print("IDs:", encoding["input_ids"]) print("Attention Mask:", encoding["attention_mask"])
Datasets Library
The datasets library provides access to thousands of datasets with a simple, consistent API:
from datasets import load_dataset # Load a popular NLP dataset dataset = load_dataset("imdb") print(dataset) # DatasetDict({ # train: Dataset({features: ['text', 'label'], num_rows: 25000}) # test: Dataset({features: ['text', 'label'], num_rows: 25000}) # }) # Access a sample print(dataset["train"][0]["text"][:100]) print(dataset["train"][0]["label"]) # 0 = negative, 1 = positive
Fine-Tuning with Trainer API
The Trainer API simplifies the fine-tuning process, handling training loops, evaluation, logging, and checkpointing:
from transformers import ( AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer ) from datasets import load_dataset # Load model and tokenizer model_name = "distilbert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) # Load and tokenize dataset dataset = load_dataset("imdb") def tokenize(examples): return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512) tokenized = dataset.map(tokenize, batched=True) # Define training arguments args = TrainingArguments( output_dir="./results", num_train_epochs=3, per_device_train_batch_size=16, evaluation_strategy="epoch", learning_rate=2e-5, ) # Create Trainer and train trainer = Trainer( model=model, args=args, train_dataset=tokenized["train"], eval_dataset=tokenized["test"], ) trainer.train()
Model Sharing
After fine-tuning, you can share your model on the Hub for others to use:
# Login to Hugging Face from huggingface_hub import login login(token="your_token_here") # Push model and tokenizer to the Hub model.push_to_hub("my-sentiment-model") tokenizer.push_to_hub("my-sentiment-model") # Anyone can now use your model: # pipeline("sentiment-analysis", model="your-username/my-sentiment-model")
Spaces for Demos
Hugging Face Spaces lets you host interactive ML demos using Gradio or Streamlit. You can create a live demo of your model in minutes:
import gradio as gr from transformers import pipeline classifier = pipeline("sentiment-analysis") def analyze(text): result = classifier(text)[0] return f"{result['label']}: {result['score']:.3f}" demo = gr.Interface( fn=analyze, inputs=gr.Textbox(label="Enter text"), outputs=gr.Textbox(label="Sentiment"), title="Sentiment Analysis Demo" ) demo.launch()
Lilly Tech Systems