Intermediate

Attention Is All You Need

A comprehensive guide to attention is all you need within the context of transformer architecture deep dive.

The Paper That Changed Everything

Published in 2017 by Vaswani et al. at Google Brain, "Attention Is All You Need" introduced the Transformer architecture and fundamentally changed the landscape of deep learning. Before transformers, recurrent neural networks (RNNs) and their variants (LSTM, GRU) dominated sequence modeling tasks like machine translation, text generation, and speech recognition. The transformer replaced recurrence entirely with attention mechanisms, enabling dramatically better parallelization and scaling.

The key insight was that attention mechanisms alone, without any recurrence or convolution, could model dependencies between all positions in a sequence simultaneously. This parallel processing capability meant that transformers could leverage modern GPU hardware far more efficiently than RNNs, which process tokens sequentially.

Why RNNs Were Not Enough

RNNs process sequences one token at a time, maintaining a hidden state that theoretically captures information from all previous tokens. This sequential nature creates two fundamental problems:

  • Sequential computation bottleneck — Each token depends on the previous token's hidden state, preventing parallel processing across the sequence. Training on long sequences is slow because the computation cannot be distributed across GPU cores.
  • Long-range dependency degradation — Despite improvements from LSTM and GRU gates, information from early tokens still degrades as it passes through many sequential steps. A token at position 500 has difficulty attending to a token at position 10.
  • Memory constraints — The fixed-size hidden state must compress all relevant information from the entire history, creating an information bottleneck.

The Attention Solution

Attention mechanisms allow every token to directly attend to every other token in the sequence, regardless of distance. The computational path between any two tokens is O(1) instead of O(n), and the entire attention computation can be performed in parallel using matrix operations that GPUs excel at.

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):
    """Core attention mechanism from the Transformer paper."""
    d_k = Q.size(-1)
    # Compute attention scores
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
    # Apply mask (for decoder self-attention)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    # Softmax to get attention weights
    attention_weights = F.softmax(scores, dim=-1)
    # Weighted sum of values
    output = torch.matmul(attention_weights, V)
    return output, attention_weights
💡
Key formula: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V. The scaling factor sqrt(d_k) prevents the dot products from growing too large, which would push the softmax into regions with vanishingly small gradients.

The Transformer Architecture Overview

The original transformer follows an encoder-decoder structure designed for sequence-to-sequence tasks like machine translation:

Encoder Stack

  1. Input embedding — Convert input tokens to dense vectors
  2. Positional encoding — Add position information (since there is no recurrence)
  3. N encoder layers, each containing:
    • Multi-head self-attention sublayer
    • Position-wise feed-forward network sublayer
    • Residual connections and layer normalization around each sublayer

Decoder Stack

  1. Output embedding + positional encoding
  2. N decoder layers, each containing:
    • Masked multi-head self-attention (prevents attending to future tokens)
    • Multi-head cross-attention (attends to encoder output)
    • Position-wise feed-forward network
    • Residual connections and layer normalization
  3. Linear layer + softmax for output probabilities

Impact and Legacy

The transformer's impact extends far beyond machine translation. It spawned BERT (encoder-only), GPT (decoder-only), T5 (encoder-decoder), and ultimately the large language models (LLMs) that power modern AI applications. Vision Transformers (ViT) brought the architecture to computer vision. The transformer has become the universal architecture for deep learning, applicable to text, images, audio, video, protein sequences, and more.

Key Architectural Innovations

  • Self-attention — Enables modeling relationships between all positions in a sequence
  • Multi-head attention — Allows the model to attend to information from different representation subspaces
  • Positional encoding — Injects sequence order information without recurrence
  • Residual connections — Enable training of very deep networks by providing gradient shortcuts
  • Layer normalization — Stabilizes training by normalizing activations
Scaling challenge: Self-attention has O(n^2) time and memory complexity with respect to sequence length. For a sequence of 1000 tokens, the attention matrix has 1,000,000 entries. This quadratic scaling is the primary limitation driving research into efficient attention variants.

In the following lessons, we will dissect each component of the transformer in detail, starting with the self-attention mechanism that makes it all possible.