Transformers
Discover the architecture that revolutionized deep learning: self-attention, multi-head attention, positional encoding, and the models that changed everything — BERT, GPT, and beyond.
The Transformer Architecture
The Transformer was introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google. It replaced recurrent processing with self-attention, allowing the model to process all positions in a sequence simultaneously rather than one at a time.
This parallel processing brought two major advantages:
- Speed: Transformers can be trained much faster than RNNs because they process all tokens in parallel on GPUs.
- Long-range dependencies: Every token can directly attend to every other token, regardless of distance, solving the vanishing gradient problem that plagued RNNs.
Self-Attention Mechanism
Self-attention is the core innovation of the Transformer. For each token in the input, it computes how much attention to pay to every other token. This is done through three learned projections:
# For each token, compute: Query (Q): "What am I looking for?" Key (K): "What do I contain?" Value (V): "What information do I provide?" # Attention scores: Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V # The dot product Q*K^T measures similarity # Division by sqrt(d_k) prevents large values # Softmax converts to probabilities # Multiply by V to get weighted values
For example, in the sentence "The cat sat on the mat because it was tired," self-attention helps the model understand that "it" refers to "cat" by assigning high attention weights between those tokens.
Multi-Head Attention
Rather than computing a single attention function, Transformers use multi-head attention: multiple attention heads running in parallel, each learning different relationships.
- One head might learn syntactic relationships (subject-verb agreement).
- Another might learn semantic relationships (pronoun-noun reference).
- Another might focus on positional patterns (adjacent words).
The outputs of all heads are concatenated and linearly projected. Typical models use 8 to 16 attention heads per layer.
Positional Encoding
Since Transformers process all tokens simultaneously (unlike RNNs which process sequentially), they have no inherent notion of position. Positional encodings are added to the input embeddings to give the model information about token order.
The original paper uses sine and cosine functions of different frequencies. Modern models often use learned positional embeddings or relative position encodings (like RoPE in LLaMA).
Encoder-Decoder Structure
The original Transformer has two main components:
- Encoder: Processes the input sequence. Each layer has multi-head self-attention followed by a feed-forward network, with residual connections and layer normalization. The encoder produces rich contextual representations.
- Decoder: Generates the output sequence token by token. In addition to self-attention and feed-forward layers, it has cross-attention layers that attend to the encoder's output. It uses masked self-attention to prevent looking at future tokens during generation.
Foundation Models
The Transformer architecture spawned three families of foundation models:
| Model Family | Architecture | Pre-training Task | Best For |
|---|---|---|---|
| BERT (Google, 2018) | Encoder-only | Masked language modeling & next sentence prediction | Understanding: classification, NER, Q&A |
| GPT (OpenAI, 2018–) | Decoder-only | Next token prediction (autoregressive) | Generation: text, code, conversation |
| T5 (Google, 2019) | Encoder-decoder | Text-to-text (all tasks as text generation) | Versatile: translation, summarization, Q&A |
Modern Foundation Models
The Transformer architecture has been scaled to create increasingly powerful models:
- GPT-4 / GPT-4o (OpenAI): Multimodal models that understand text and images, powering ChatGPT.
- Claude (Anthropic): Safety-focused models with long context windows (200K tokens).
- Gemini (Google): Multimodal models integrated with Google's ecosystem.
- LLaMA (Meta): Open-weight models that enabled open-source LLM development.
- Vision Transformers (ViT): Applying Transformers to images, competing with CNNs.
Using Transformers with Hugging Face
from transformers import pipeline, AutoTokenizer, AutoModel # Sentiment analysis with a pre-trained model classifier = pipeline("sentiment-analysis") result = classifier("Transformers are amazing for NLP!") print(result) # [{'label': 'POSITIVE', 'score': 0.9998}] # Text generation with GPT-2 generator = pipeline("text-generation", model="gpt2") text = generator( "Deep learning has revolutionized", max_length=50, num_return_sequences=1 ) print(text[0]['generated_text']) # Named Entity Recognition ner = pipeline("ner", grouped_entities=True) entities = ner("Anthropic released Claude in San Francisco") print(entities) # [{'entity_group': 'ORG', 'word': 'Anthropic', ...}, # {'entity_group': 'PER', 'word': 'Claude', ...}, # {'entity_group': 'LOC', 'word': 'San Francisco', ...}] # Load any model and tokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") model = AutoModel.from_pretrained("bert-base-uncased") tokens = tokenizer("Hello, world!", return_tensors="pt") outputs = model(**tokens) print(outputs.last_hidden_state.shape) # torch.Size([1, 5, 768]) (batch, tokens, hidden_dim)
pipeline API for quick tasks, then move to loading models directly for fine-tuning and custom architectures.
Lilly Tech Systems