Beginner

Introduction to AI Architectures

Explore the complete landscape of neural network architectures — from the 1958 perceptron to 2024's state space models. Understand how each architecture works, what problems it solves, and how to choose the right one for your project.

Why Architecture Matters

Every AI system is built on an architecture — a specific arrangement of layers, connections, and operations that determines what the model can learn and how efficiently it learns it. The architecture you choose has a greater impact on your system's capabilities than almost any other design decision.

A Transformer can process an entire document in parallel but struggles with infinite-length streams. A Recurrent Neural Network processes sequences one step at a time but can theoretically handle unbounded input. A Convolutional Neural Network excels at extracting spatial patterns from images but cannot natively model long-range text dependencies. Each architecture embodies fundamental tradeoffs between expressiveness, efficiency, and scalability.

Understanding these architectures is essential whether you are building AI systems, evaluating AI products, or reading research papers. This course gives you that understanding from the ground up.

💡

Key insight: You do not need to implement these architectures from scratch to use them effectively. But understanding how they work helps you make better decisions about which to use, how to configure them, and how to debug them when things go wrong.

Timeline of AI Architectures

The history of AI architectures is a story of increasingly powerful ways to process information. Each breakthrough architecture unlocked capabilities that were previously impossible.

1958 — Perceptron

Frank Rosenblatt's perceptron was the first trainable neural network. A single layer of weights that could learn linear decision boundaries. It sparked enormous excitement, then disappointment when Minsky and Papert showed it could not learn XOR.

1986 — Multi-Layer Perceptron

Backpropagation (Rumelhart, Hinton, Williams) made it possible to train networks with multiple hidden layers. MLPs could learn nonlinear functions, but they struggled with images, sequences, and anything requiring spatial or temporal structure.

1989 — CNN (LeNet)

Yann LeCun's Convolutional Neural Network introduced weight sharing and local connectivity for image processing. LeNet could recognize handwritten digits with remarkable accuracy, pioneering the approach that would later dominate computer vision.

1997 — LSTM

Hochreiter and Schmidhuber's Long Short-Term Memory solved the vanishing gradient problem in recurrent networks. LSTMs could remember information over hundreds of time steps, enabling breakthroughs in speech recognition and machine translation.

2012 — AlexNet (Deep CNN)

Krizhevsky, Sutskever, and Hinton's AlexNet won ImageNet by a massive margin using deep CNNs trained on GPUs. This single result reignited the entire field of deep learning and launched the modern AI era.

2014 — GAN

Goodfellow et al. introduced Generative Adversarial Networks: two networks (generator and discriminator) competing against each other to produce realistic synthetic data. GANs revolutionized image generation and remain influential today.

2014 — Seq2Seq + Attention

Bahdanau, Cho, and Bengio added attention mechanisms to encoder-decoder models, allowing the decoder to focus on relevant parts of the input. This breakthrough dramatically improved machine translation quality.

2017 — Transformer

Vaswani et al.'s "Attention Is All You Need" replaced recurrence entirely with self-attention. The Transformer enabled massive parallelization during training and became the foundation for GPT, BERT, T5, and virtually every modern language model.

2020 — Vision Transformer (ViT)

Dosovitskiy et al. proved that Transformers could match or exceed CNNs on image classification by treating images as sequences of patches. This began the Transformer's conquest of computer vision.

2020 — Diffusion Models

Ho et al.'s Denoising Diffusion Probabilistic Models showed that iteratively denoising random noise could generate high-quality images. Diffusion models now power DALL-E, Stable Diffusion, and Midjourney.

2023 — Mixture of Experts

While MoE existed earlier, models like Mixtral and (reportedly) GPT-4 brought sparse expert routing to the forefront. MoE enables trillion-parameter models that only activate a fraction of their weights per input.

2023–2024 — State Space Models

S4 and Mamba introduced structured state space models that process sequences in linear time (vs. quadratic for Transformers). SSMs are emerging as a serious alternative for long-context and streaming applications.

Complete Architecture Comparison

The following table compares all 12 architectures covered in this course across key dimensions. Use this as a quick reference when deciding which architecture to study or deploy.

Architecture	Year	Key Innovation	Best For	Example Models
Transformer	2017	Self-attention replaces recurrence	NLP, code, general-purpose AI	GPT-4, Claude, BERT, T5, LLaMA
CNN	1989	Convolution + weight sharing	Image classification, object detection	ResNet, EfficientNet, YOLO, VGG
RNN	1986	Recurrent connections for sequences	Time series, simple sequences	Vanilla RNN, Elman network
LSTM	1997	Gated memory cells	Speech, music, long sequences	LSTM, GRU, BiLSTM
Encoder-Decoder	2014	Compress then generate	Translation, summarization	T5, BART, mBART, Whisper
Attention Mechanisms	2014	Dynamic input weighting	Enhancing any architecture	Bahdanau, Luong, Multi-Head, Flash
Diffusion Models	2020	Iterative denoising	Image/video/audio generation	Stable Diffusion, DALL-E 3, Sora
GAN	2014	Adversarial training	Image synthesis, style transfer	StyleGAN3, CycleGAN, Pix2Pix
Mixture of Experts	1991/2023	Sparse expert routing	Scaling to massive models	Mixtral, Switch Transformer, GPT-4
State Space Models	2021	Linear-time sequence modeling	Long sequences, streaming	Mamba, S4, Hyena, RWKV
Graph Neural Networks	2005/2017	Message passing on graphs	Molecules, social networks, KGs	GCN, GAT, GraphSAGE, GIN
Autoencoders / VAEs	1986/2013	Learned compression + generation	Dimensionality reduction, anomaly detection	VAE, VQ-VAE, Beta-VAE

How to Choose an Architecture

Selecting the right architecture depends on your data type, task requirements, computational budget, and deployment constraints. Here is a practical decision framework:

💡

Rule of thumb: In 2025, start with a Transformer-based model for most tasks. Only switch to a specialized architecture when you have a clear reason — such as real-time image processing (CNN), streaming data (SSM), graph-structured data (GNN), or image generation (Diffusion).

Your Data / Task	Recommended Architecture	Why
Text generation, chatbots, code	Transformer (decoder-only)	Best autoregressive generation quality, massive pretraining available
Text understanding, classification	Transformer (encoder-only)	Bidirectional context captures meaning better than left-to-right models
Translation, summarization	Transformer (encoder-decoder)	Separate encoding and decoding stages are ideal for seq2seq tasks
Image classification	CNN or Vision Transformer	CNNs are faster with less data; ViTs win with large datasets
Object detection, real-time	CNN (YOLO, EfficientDet)	Optimized for speed and deployed on edge devices
Image generation	Diffusion Model	Best quality and controllability for image synthesis in 2025
Style transfer, image-to-image	GAN (CycleGAN, Pix2Pix)	Fast single-pass generation for paired/unpaired image translation
Very long sequences (100K+ tokens)	State Space Model (Mamba)	Linear-time complexity avoids quadratic attention bottleneck
Molecular data, social graphs	Graph Neural Network	Native support for non-Euclidean, relational data structures
Anomaly detection, compression	Autoencoder / VAE	Learns compact representations, flags out-of-distribution inputs
Massive model, limited compute	Mixture of Experts	Only activates a subset of parameters per input, reducing FLOPs
Time series, sensor data	LSTM / GRU or SSM	Sequential processing with memory; SSMs are faster for long series

The Building Blocks of Neural Architectures

Despite their diversity, all neural network architectures share a common set of fundamental building blocks. Understanding these components gives you a foundation for understanding any architecture you encounter.

Layers

Every neural network is composed of layers — functions that transform their input into an output. The most common types include:

Linear (Dense/Fully Connected): Multiplies input by a weight matrix and adds a bias. The most basic learnable transformation: y = Wx + b.
Convolutional: Applies a sliding filter across the input, extracting local patterns. Used extensively in CNNs for spatial data.
Recurrent: Maintains a hidden state that is updated at each time step, allowing the layer to process sequential data with memory.
Attention: Computes weighted relationships between all elements in a sequence, enabling each element to attend to every other element.
Embedding: Maps discrete tokens (words, pixel patches, graph nodes) into continuous vector spaces where similar items are close together.

Activation Functions

Activation functions introduce nonlinearity, enabling neural networks to learn complex patterns beyond simple linear relationships.

Activation	Formula	Range	Common Use
ReLU	`max(0, x)`	[0, infinity)	Default for hidden layers in CNNs and MLPs
GELU	`x * Φ(x)`	(-0.17, infinity)	Default in Transformers (GPT, BERT)
SiLU / Swish	`x * sigmoid(x)`	(-0.28, infinity)	Modern CNNs, EfficientNet, LLaMA
Sigmoid	`1 / (1 + e^-x)`	(0, 1)	Gates in LSTMs, binary outputs
Tanh	`(e^x - e^-x) / (e^x + e^-x)`	(-1, 1)	RNN hidden states, value normalization
Softmax	`e^xi / sum(e^xj)`	(0, 1), sums to 1	Output layer for classification, attention weights

Normalization

Normalization techniques stabilize and accelerate training by controlling the distribution of activations within the network.

Batch Normalization: Normalizes across the batch dimension. Dominant in CNNs. Depends on batch statistics, which can be problematic for small batches or inference.
Layer Normalization: Normalizes across the feature dimension for each individual sample. The standard in Transformers because it is independent of batch size.
RMSNorm: A simplified version of Layer Normalization that only uses the root mean square (no mean subtraction). Used in LLaMA and other efficient Transformer variants.
Group Normalization: Normalizes across groups of channels. Used in CNNs when batch sizes are small (e.g., object detection, segmentation).

Skip Connections (Residual Connections)

Introduced in ResNet (2015), skip connections add the input of a layer directly to its output: output = F(x) + x. This seemingly simple modification was revolutionary because it allows gradients to flow directly through the network during backpropagation, enabling training of networks with hundreds or even thousands of layers.

Skip connections are now ubiquitous. Every Transformer block uses them around both the attention and feed-forward layers. Every modern CNN uses them. They are the single most important architectural innovation for training deep networks.

💡

Why skip connections work: Without skip connections, gradients must pass through every layer's weights during backpropagation, leading to vanishing or exploding gradients. With skip connections, gradients have a "highway" that bypasses intermediate layers, ensuring that even the earliest layers in a deep network receive meaningful gradient signals.

Dropout and Regularization

Regularization techniques prevent overfitting — when a model memorizes training data instead of learning general patterns.

Dropout: Randomly sets a fraction of activations to zero during training, forcing the network to learn redundant representations. Standard rate: 0.1 for Transformers, 0.5 for older CNNs.
Weight Decay: Adds a penalty proportional to the magnitude of weights, encouraging smaller weight values and smoother decision boundaries.
Data Augmentation: Artificially expands the training set by applying transformations (rotation, cropping, noise) to existing data.

Positional Information

Some architectures (especially Transformers) process all input elements simultaneously and therefore have no inherent notion of order. Positional encodings inject sequence position information:

Sinusoidal Encoding: Uses sine and cosine functions of different frequencies (original Transformer).
Learned Positional Embeddings: Trainable vectors for each position (BERT, GPT-2).
Rotary Position Embedding (RoPE): Encodes relative position by rotating query and key vectors. Used in LLaMA, Mistral, and most modern LLMs because it generalizes better to unseen sequence lengths.
ALiBi: Adds a linear bias to attention scores based on distance. Used in BLOOM and some efficient Transformers.

Architecture Complexity at a Glance

Architecture	Time Complexity	Memory	Parallelizable	Training Difficulty
MLP	O(n)	Low	Yes	Easy
CNN	O(n · k)	Medium	Yes	Moderate
RNN/LSTM	O(n)	Medium	No (sequential)	Hard (gradients)
Transformer	O(n²)	High	Yes	Moderate (with tricks)
SSM (Mamba)	O(n)	Low-Medium	Yes (with scan)	Moderate
GAN	O(n)	Medium	Yes	Hard (stability)
Diffusion	O(n · T)	High	Per step: Yes	Moderate
GNN	O(V + E)	Varies	Partially	Moderate

What Is Next

With this foundation in place, you are ready to dive deep into each architecture individually. In the next lesson, we start with the most important architecture in modern AI: the Transformer. You will learn exactly how self-attention works, why it replaced recurrence, and how the same basic architecture powers everything from GPT-4 to protein folding models.

Next → Transformer