Beginner

Introduction to AI Architectures

Explore the complete landscape of neural network architectures — from the 1958 perceptron to 2024's state space models. Understand how each architecture works, what problems it solves, and how to choose the right one for your project.

Why Architecture Matters

Every AI system is built on an architecture — a specific arrangement of layers, connections, and operations that determines what the model can learn and how efficiently it learns it. The architecture you choose has a greater impact on your system's capabilities than almost any other design decision.

A Transformer can process an entire document in parallel but struggles with infinite-length streams. A Recurrent Neural Network processes sequences one step at a time but can theoretically handle unbounded input. A Convolutional Neural Network excels at extracting spatial patterns from images but cannot natively model long-range text dependencies. Each architecture embodies fundamental tradeoffs between expressiveness, efficiency, and scalability.

Understanding these architectures is essential whether you are building AI systems, evaluating AI products, or reading research papers. This course gives you that understanding from the ground up.

💡
Key insight: You do not need to implement these architectures from scratch to use them effectively. But understanding how they work helps you make better decisions about which to use, how to configure them, and how to debug them when things go wrong.

Timeline of AI Architectures

The history of AI architectures is a story of increasingly powerful ways to process information. Each breakthrough architecture unlocked capabilities that were previously impossible.

1958 — Perceptron

Frank Rosenblatt's perceptron was the first trainable neural network. A single layer of weights that could learn linear decision boundaries. It sparked enormous excitement, then disappointment when Minsky and Papert showed it could not learn XOR.

1986 — Multi-Layer Perceptron

Backpropagation (Rumelhart, Hinton, Williams) made it possible to train networks with multiple hidden layers. MLPs could learn nonlinear functions, but they struggled with images, sequences, and anything requiring spatial or temporal structure.

1989 — CNN (LeNet)

Yann LeCun's Convolutional Neural Network introduced weight sharing and local connectivity for image processing. LeNet could recognize handwritten digits with remarkable accuracy, pioneering the approach that would later dominate computer vision.

1997 — LSTM

Hochreiter and Schmidhuber's Long Short-Term Memory solved the vanishing gradient problem in recurrent networks. LSTMs could remember information over hundreds of time steps, enabling breakthroughs in speech recognition and machine translation.

2012 — AlexNet (Deep CNN)

Krizhevsky, Sutskever, and Hinton's AlexNet won ImageNet by a massive margin using deep CNNs trained on GPUs. This single result reignited the entire field of deep learning and launched the modern AI era.

2014 — GAN

Goodfellow et al. introduced Generative Adversarial Networks: two networks (generator and discriminator) competing against each other to produce realistic synthetic data. GANs revolutionized image generation and remain influential today.

2014 — Seq2Seq + Attention

Bahdanau, Cho, and Bengio added attention mechanisms to encoder-decoder models, allowing the decoder to focus on relevant parts of the input. This breakthrough dramatically improved machine translation quality.

2017 — Transformer

Vaswani et al.'s "Attention Is All You Need" replaced recurrence entirely with self-attention. The Transformer enabled massive parallelization during training and became the foundation for GPT, BERT, T5, and virtually every modern language model.

2020 — Vision Transformer (ViT)

Dosovitskiy et al. proved that Transformers could match or exceed CNNs on image classification by treating images as sequences of patches. This began the Transformer's conquest of computer vision.

2020 — Diffusion Models

Ho et al.'s Denoising Diffusion Probabilistic Models showed that iteratively denoising random noise could generate high-quality images. Diffusion models now power DALL-E, Stable Diffusion, and Midjourney.

2023 — Mixture of Experts

While MoE existed earlier, models like Mixtral and (reportedly) GPT-4 brought sparse expert routing to the forefront. MoE enables trillion-parameter models that only activate a fraction of their weights per input.

2023–2024 — State Space Models

S4 and Mamba introduced structured state space models that process sequences in linear time (vs. quadratic for Transformers). SSMs are emerging as a serious alternative for long-context and streaming applications.

Complete Architecture Comparison

The following table compares all 12 architectures covered in this course across key dimensions. Use this as a quick reference when deciding which architecture to study or deploy.

ArchitectureYearKey InnovationBest ForExample Models
Transformer 2017 Self-attention replaces recurrence NLP, code, general-purpose AI GPT-4, Claude, BERT, T5, LLaMA
CNN 1989 Convolution + weight sharing Image classification, object detection ResNet, EfficientNet, YOLO, VGG
RNN 1986 Recurrent connections for sequences Time series, simple sequences Vanilla RNN, Elman network
LSTM 1997 Gated memory cells Speech, music, long sequences LSTM, GRU, BiLSTM
Encoder-Decoder 2014 Compress then generate Translation, summarization T5, BART, mBART, Whisper
Attention Mechanisms 2014 Dynamic input weighting Enhancing any architecture Bahdanau, Luong, Multi-Head, Flash
Diffusion Models 2020 Iterative denoising Image/video/audio generation Stable Diffusion, DALL-E 3, Sora
GAN 2014 Adversarial training Image synthesis, style transfer StyleGAN3, CycleGAN, Pix2Pix
Mixture of Experts 1991/2023 Sparse expert routing Scaling to massive models Mixtral, Switch Transformer, GPT-4
State Space Models 2021 Linear-time sequence modeling Long sequences, streaming Mamba, S4, Hyena, RWKV
Graph Neural Networks 2005/2017 Message passing on graphs Molecules, social networks, KGs GCN, GAT, GraphSAGE, GIN
Autoencoders / VAEs 1986/2013 Learned compression + generation Dimensionality reduction, anomaly detection VAE, VQ-VAE, Beta-VAE

How to Choose an Architecture

Selecting the right architecture depends on your data type, task requirements, computational budget, and deployment constraints. Here is a practical decision framework:

💡
Rule of thumb: In 2025, start with a Transformer-based model for most tasks. Only switch to a specialized architecture when you have a clear reason — such as real-time image processing (CNN), streaming data (SSM), graph-structured data (GNN), or image generation (Diffusion).
Your Data / TaskRecommended ArchitectureWhy
Text generation, chatbots, code Transformer (decoder-only) Best autoregressive generation quality, massive pretraining available
Text understanding, classification Transformer (encoder-only) Bidirectional context captures meaning better than left-to-right models
Translation, summarization Transformer (encoder-decoder) Separate encoding and decoding stages are ideal for seq2seq tasks
Image classification CNN or Vision Transformer CNNs are faster with less data; ViTs win with large datasets
Object detection, real-time CNN (YOLO, EfficientDet) Optimized for speed and deployed on edge devices
Image generation Diffusion Model Best quality and controllability for image synthesis in 2025
Style transfer, image-to-image GAN (CycleGAN, Pix2Pix) Fast single-pass generation for paired/unpaired image translation
Very long sequences (100K+ tokens) State Space Model (Mamba) Linear-time complexity avoids quadratic attention bottleneck
Molecular data, social graphs Graph Neural Network Native support for non-Euclidean, relational data structures
Anomaly detection, compression Autoencoder / VAE Learns compact representations, flags out-of-distribution inputs
Massive model, limited compute Mixture of Experts Only activates a subset of parameters per input, reducing FLOPs
Time series, sensor data LSTM / GRU or SSM Sequential processing with memory; SSMs are faster for long series

The Building Blocks of Neural Architectures

Despite their diversity, all neural network architectures share a common set of fundamental building blocks. Understanding these components gives you a foundation for understanding any architecture you encounter.

Layers

Every neural network is composed of layers — functions that transform their input into an output. The most common types include:

  • Linear (Dense/Fully Connected): Multiplies input by a weight matrix and adds a bias. The most basic learnable transformation: y = Wx + b.
  • Convolutional: Applies a sliding filter across the input, extracting local patterns. Used extensively in CNNs for spatial data.
  • Recurrent: Maintains a hidden state that is updated at each time step, allowing the layer to process sequential data with memory.
  • Attention: Computes weighted relationships between all elements in a sequence, enabling each element to attend to every other element.
  • Embedding: Maps discrete tokens (words, pixel patches, graph nodes) into continuous vector spaces where similar items are close together.

Activation Functions

Activation functions introduce nonlinearity, enabling neural networks to learn complex patterns beyond simple linear relationships.

ActivationFormulaRangeCommon Use
ReLUmax(0, x)[0, infinity)Default for hidden layers in CNNs and MLPs
GELUx * Φ(x)(-0.17, infinity)Default in Transformers (GPT, BERT)
SiLU / Swishx * sigmoid(x)(-0.28, infinity)Modern CNNs, EfficientNet, LLaMA
Sigmoid1 / (1 + e^-x)(0, 1)Gates in LSTMs, binary outputs
Tanh(e^x - e^-x) / (e^x + e^-x)(-1, 1)RNN hidden states, value normalization
Softmaxe^xi / sum(e^xj)(0, 1), sums to 1Output layer for classification, attention weights

Normalization

Normalization techniques stabilize and accelerate training by controlling the distribution of activations within the network.

  • Batch Normalization: Normalizes across the batch dimension. Dominant in CNNs. Depends on batch statistics, which can be problematic for small batches or inference.
  • Layer Normalization: Normalizes across the feature dimension for each individual sample. The standard in Transformers because it is independent of batch size.
  • RMSNorm: A simplified version of Layer Normalization that only uses the root mean square (no mean subtraction). Used in LLaMA and other efficient Transformer variants.
  • Group Normalization: Normalizes across groups of channels. Used in CNNs when batch sizes are small (e.g., object detection, segmentation).

Skip Connections (Residual Connections)

Introduced in ResNet (2015), skip connections add the input of a layer directly to its output: output = F(x) + x. This seemingly simple modification was revolutionary because it allows gradients to flow directly through the network during backpropagation, enabling training of networks with hundreds or even thousands of layers.

Skip connections are now ubiquitous. Every Transformer block uses them around both the attention and feed-forward layers. Every modern CNN uses them. They are the single most important architectural innovation for training deep networks.

💡
Why skip connections work: Without skip connections, gradients must pass through every layer's weights during backpropagation, leading to vanishing or exploding gradients. With skip connections, gradients have a "highway" that bypasses intermediate layers, ensuring that even the earliest layers in a deep network receive meaningful gradient signals.

Dropout and Regularization

Regularization techniques prevent overfitting — when a model memorizes training data instead of learning general patterns.

  • Dropout: Randomly sets a fraction of activations to zero during training, forcing the network to learn redundant representations. Standard rate: 0.1 for Transformers, 0.5 for older CNNs.
  • Weight Decay: Adds a penalty proportional to the magnitude of weights, encouraging smaller weight values and smoother decision boundaries.
  • Data Augmentation: Artificially expands the training set by applying transformations (rotation, cropping, noise) to existing data.

Positional Information

Some architectures (especially Transformers) process all input elements simultaneously and therefore have no inherent notion of order. Positional encodings inject sequence position information:

  • Sinusoidal Encoding: Uses sine and cosine functions of different frequencies (original Transformer).
  • Learned Positional Embeddings: Trainable vectors for each position (BERT, GPT-2).
  • Rotary Position Embedding (RoPE): Encodes relative position by rotating query and key vectors. Used in LLaMA, Mistral, and most modern LLMs because it generalizes better to unseen sequence lengths.
  • ALiBi: Adds a linear bias to attention scores based on distance. Used in BLOOM and some efficient Transformers.

Architecture Complexity at a Glance

ArchitectureTime ComplexityMemoryParallelizableTraining Difficulty
MLPO(n)LowYesEasy
CNNO(n · k)MediumYesModerate
RNN/LSTMO(n)MediumNo (sequential)Hard (gradients)
TransformerO(n²)HighYesModerate (with tricks)
SSM (Mamba)O(n)Low-MediumYes (with scan)Moderate
GANO(n)MediumYesHard (stability)
DiffusionO(n · T)HighPer step: YesModerate
GNNO(V + E)VariesPartiallyModerate

What Is Next

With this foundation in place, you are ready to dive deep into each architecture individually. In the next lesson, we start with the most important architecture in modern AI: the Transformer. You will learn exactly how self-attention works, why it replaced recurrence, and how the same basic architecture powers everything from GPT-4 to protein folding models.