Transformer Variants
A comprehensive guide to transformer variants within the context of transformer architecture deep dive.
The Transformer Family Tree
Since the original Transformer in 2017, hundreds of variants have been proposed, each optimizing for different objectives: longer context, faster inference, better training stability, or specialized domains. Understanding the major variants and their innovations helps architects choose the right foundation for their systems.
Encoder-Only Variants
BERT (2018)
Bidirectional Encoder Representations from Transformers introduced the masked language modeling (MLM) pre-training objective. By randomly masking 15% of input tokens and training the model to predict them, BERT learns deep bidirectional representations. BERT revolutionized NLP benchmarks and remains widely used for classification and embedding tasks.
RoBERTa (2019)
A robustly optimized version of BERT that showed the original was significantly undertrained. Key changes: removed the next sentence prediction objective, trained on much more data, used dynamic masking, and trained with larger batches. RoBERTa improved on BERT across all benchmarks without any architectural changes.
DeBERTa (2020)
Introduced disentangled attention that separately represents content and position, plus an enhanced mask decoder for pre-training. DeBERTa achieved state-of-the-art results on SuperGLUE and remains one of the strongest encoder models.
# Using modern encoder models with HuggingFace
from transformers import AutoModel, AutoTokenizer
# BERT for embeddings
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("AI architecture is fascinating", return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state # (1, seq_len, 768)
Decoder-Only Variants
GPT Series
The GPT (Generative Pre-trained Transformer) series from OpenAI established decoder-only transformers as the dominant paradigm for language modeling. GPT-2 (1.5B parameters) showed emergent abilities at scale. GPT-3 (175B) demonstrated few-shot learning. GPT-4 pushed multimodal capabilities.
LLaMA Series
Meta's LLaMA models introduced several architectural improvements that became standard:
- RoPE for positional encoding instead of learned embeddings
- SwiGLU activation in the FFN instead of ReLU
- RMSNorm instead of LayerNorm for faster computation
- Grouped Query Attention for efficient KV caching
- Pre-normalization instead of post-normalization
Mistral and Mixtral
Mistral 7B introduced sliding window attention for efficient long-context handling. Mixtral extended this with Mixture of Experts (MoE), where each token is routed to 2 of 8 expert FFN layers. This allows a model with 46B total parameters to use only 13B parameters per forward pass, achieving better quality per FLOP.
Encoder-Decoder Variants
T5 (2019)
Text-to-Text Transfer Transformer frames every NLP task as text-to-text: classification becomes generating a label string, summarization becomes generating a shorter text, and translation maps between languages. This unified framework simplifies multi-task training.
BART (2019)
Combines BERT-style bidirectional encoding with GPT-style autoregressive decoding. Pre-trained by corrupting text with various noise functions and learning to reconstruct the original. Particularly strong for summarization and generation tasks.
Efficient Transformer Variants
Several architectures address the quadratic attention complexity:
- Longformer — Combines local sliding window attention with global attention on specific tokens. Scales linearly with sequence length.
- BigBird — Uses random, window, and global attention patterns. Proven theoretically to be a universal approximator of sequence functions.
- Reformer — Uses locality-sensitive hashing to reduce attention from O(n^2) to O(n log n).
- Mamba — Not technically a transformer, but a selective state space model that processes sequences in linear time. Gaining traction as a transformer alternative for very long sequences.
# Mixture of Experts layer (simplified)
class MoELayer(nn.Module):
def __init__(self, d_model, d_ff, num_experts=8, top_k=2):
super().__init__()
self.experts = nn.ModuleList([
nn.Sequential(
nn.Linear(d_model, d_ff),
nn.SiLU(),
nn.Linear(d_ff, d_model)
) for _ in range(num_experts)
])
self.gate = nn.Linear(d_model, num_experts)
self.top_k = top_k
def forward(self, x):
gate_logits = self.gate(x)
weights, indices = torch.topk(gate_logits, self.top_k)
weights = torch.softmax(weights, dim=-1)
# Route each token to top-k experts
output = torch.zeros_like(x)
for i, expert in enumerate(self.experts):
mask = (indices == i).any(dim=-1)
if mask.any():
output[mask] += expert(x[mask]) * weights[mask][indices[mask]==i]
return output
The final lesson in this course covers practical implementation of transformers from scratch.
Lilly Tech Systems