Intermediate

Transformers

Discover the architecture that revolutionized deep learning: self-attention, multi-head attention, positional encoding, and the models that changed everything — BERT, GPT, and beyond.

The Transformer Architecture

The Transformer was introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google. It replaced recurrent processing with self-attention, allowing the model to process all positions in a sequence simultaneously rather than one at a time.

This parallel processing brought two major advantages:

Speed: Transformers can be trained much faster than RNNs because they process all tokens in parallel on GPUs.
Long-range dependencies: Every token can directly attend to every other token, regardless of distance, solving the vanishing gradient problem that plagued RNNs.

Self-Attention Mechanism

Self-attention is the core innovation of the Transformer. For each token in the input, it computes how much attention to pay to every other token. This is done through three learned projections:

Self-Attention

# For each token, compute:
Query (Q): "What am I looking for?"
Key   (K): "What do I contain?"
Value (V): "What information do I provide?"

# Attention scores:
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

# The dot product Q*K^T measures similarity
# Division by sqrt(d_k) prevents large values
# Softmax converts to probabilities
# Multiply by V to get weighted values

For example, in the sentence "The cat sat on the mat because it was tired," self-attention helps the model understand that "it" refers to "cat" by assigning high attention weights between those tokens.

Multi-Head Attention

Rather than computing a single attention function, Transformers use multi-head attention: multiple attention heads running in parallel, each learning different relationships.

One head might learn syntactic relationships (subject-verb agreement).
Another might learn semantic relationships (pronoun-noun reference).
Another might focus on positional patterns (adjacent words).

The outputs of all heads are concatenated and linearly projected. Typical models use 8 to 16 attention heads per layer.

Positional Encoding

Since Transformers process all tokens simultaneously (unlike RNNs which process sequentially), they have no inherent notion of position. Positional encodings are added to the input embeddings to give the model information about token order.

The original paper uses sine and cosine functions of different frequencies. Modern models often use learned positional embeddings or relative position encodings (like RoPE in LLaMA).

Encoder-Decoder Structure

The original Transformer has two main components:

Encoder: Processes the input sequence. Each layer has multi-head self-attention followed by a feed-forward network, with residual connections and layer normalization. The encoder produces rich contextual representations.
Decoder: Generates the output sequence token by token. In addition to self-attention and feed-forward layers, it has cross-attention layers that attend to the encoder's output. It uses masked self-attention to prevent looking at future tokens during generation.

Foundation Models

The Transformer architecture spawned three families of foundation models:

Model Family	Architecture	Pre-training Task	Best For
BERT (Google, 2018)	Encoder-only	Masked language modeling & next sentence prediction	Understanding: classification, NER, Q&A
GPT (OpenAI, 2018–)	Decoder-only	Next token prediction (autoregressive)	Generation: text, code, conversation
T5 (Google, 2019)	Encoder-decoder	Text-to-text (all tasks as text generation)	Versatile: translation, summarization, Q&A

💡

Why Transformers replaced RNNs: RNNs process tokens sequentially (O(n) steps to see the full sequence), making them slow to train and prone to forgetting. Transformers process all tokens in parallel (O(1) steps for any pair to interact) and scale much better to large datasets and model sizes.

Modern Foundation Models

The Transformer architecture has been scaled to create increasingly powerful models:

GPT-4 / GPT-4o (OpenAI): Multimodal models that understand text and images, powering ChatGPT.
Claude (Anthropic): Safety-focused models with long context windows (200K tokens).
Gemini (Google): Multimodal models integrated with Google's ecosystem.
LLaMA (Meta): Open-weight models that enabled open-source LLM development.
Vision Transformers (ViT): Applying Transformers to images, competing with CNNs.

Using Transformers with Hugging Face

Python (Hugging Face)

from transformers import pipeline, AutoTokenizer, AutoModel

# Sentiment analysis with a pre-trained model
classifier = pipeline("sentiment-analysis")
result = classifier("Transformers are amazing for NLP!")
print(result)
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Text generation with GPT-2
generator = pipeline("text-generation", model="gpt2")
text = generator(
    "Deep learning has revolutionized",
    max_length=50,
    num_return_sequences=1
)
print(text[0]['generated_text'])

# Named Entity Recognition
ner = pipeline("ner", grouped_entities=True)
entities = ner("Anthropic released Claude in San Francisco")
print(entities)
# [{'entity_group': 'ORG', 'word': 'Anthropic', ...},
#  {'entity_group': 'PER', 'word': 'Claude', ...},
#  {'entity_group': 'LOC', 'word': 'San Francisco', ...}]

# Load any model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

tokens = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**tokens)
print(outputs.last_hidden_state.shape)
# torch.Size([1, 5, 768])  (batch, tokens, hidden_dim)

✅

Getting started: Hugging Face makes it incredibly easy to use state-of-the-art Transformer models. Start with the pipeline API for quick tasks, then move to loading models directly for fine-tuning and custom architectures.

← Previous RNNs & LSTMs Next → Training & Optimization