Neural Networks & Deep Learning (14)
Layered architectures for complex pattern recognition
Neural networks are computing systems inspired by biological neural networks. Deep learning uses networks with many layers to learn hierarchical representations. These 14 architectures form the backbone of modern AI, from image recognition to language models.
Quick Reference Table
| Architecture | Data Type | Key Innovation | Primary Use |
|---|---|---|---|
| ANN | Tabular | Universal approximation | General function approximation |
| Feedforward NN | Tabular | Layer-by-layer forward pass | Classification, regression |
| MLP | Tabular | Multiple hidden layers | Structured data, embeddings |
| CNN | Images/spatial | Convolutional filters, weight sharing | Image classification, object detection |
| RNN | Sequential | Recurrent connections | Time series, text |
| LSTM | Sequential | Gate mechanism, long-term memory | Long sequences, NLP |
| GRU | Sequential | Simplified gates (vs LSTM) | Sequence modeling (lighter) |
| Transformer | Sequential | Self-attention, parallelizable | NLP, vision, multimodal |
| GNN | Graphs | Message passing on graphs | Social networks, molecules |
| GCN | Graphs | Spectral graph convolutions | Node classification |
| GAT | Graphs | Attention on graph neighbors | Graph classification |
| Autoencoder | Any | Encode-decode bottleneck | Compression, denoising |
| VAE | Any | Probabilistic latent space | Generation, representation learning |
| GAN | Any | Adversarial training (generator vs discriminator) | Image generation, style transfer |
1. Artificial Neural Network (ANN)
Architecture: The general term for any network of interconnected artificial neurons organized in layers. Input layer receives data, hidden layers process it through weighted connections and activation functions, output layer produces predictions.
Key Innovation: Universal approximation theorem -- a single hidden layer with enough neurons can approximate any continuous function.
Use Cases: Function approximation, classification, regression -- the foundation for all neural network architectures.
import torch
import torch.nn as nn
class ANN(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super().__init__()
self.layer1 = nn.Linear(input_dim, hidden_dim)
self.activation = nn.ReLU()
self.layer2 = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
x = self.activation(self.layer1(x))
return self.layer2(x)
model = ANN(input_dim=10, hidden_dim=64, output_dim=1)
print(f"Parameters: {sum(p.numel() for p in model.parameters())}")
2. Feedforward Neural Network
Architecture: Data flows in one direction -- from input to output -- with no cycles or loops. Each neuron in one layer is connected to every neuron in the next layer (fully connected / dense layers).
Key Innovation: Backpropagation algorithm (1986) made training feedforward networks practical by efficiently computing gradients.
Use Cases: Classification, regression, function approximation for structured data.
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = MLPClassifier(
hidden_layer_sizes=(64, 32),
activation='relu',
solver='adam',
max_iter=500,
random_state=42
)
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.4f}")
3. Multi-Layer Perceptron (MLP)
Architecture: A feedforward network with multiple hidden layers, non-linear activation functions, and full connectivity between layers. The workhorse of deep learning for tabular data.
Key Innovation: Depth -- stacking multiple layers allows learning hierarchical representations.
Use Cases: Tabular data, embeddings, as sub-components in larger architectures (e.g., the FFN in Transformers).
class MLP(nn.Module):
def __init__(self, input_dim, num_classes):
super().__init__()
self.model = nn.Sequential(
nn.Linear(input_dim, 256),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, 128),
nn.BatchNorm1d(128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, num_classes)
)
def forward(self, x):
return self.model(x)
model = MLP(input_dim=30, num_classes=2)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
4. Convolutional Neural Network (CNN)
Architecture: Uses convolutional layers with learnable filters that slide across the input to detect local patterns (edges, textures, shapes). Pooling layers reduce spatial dimensions. Fully connected layers at the end combine features for classification.
Key Innovation: Weight sharing (same filter applied across entire input) and local connectivity drastically reduce parameters compared to fully connected networks.
Use Cases: Image classification, object detection, semantic segmentation, video analysis, medical imaging.
class CNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(1, 32, kernel_size=3, padding=1), # 28x28 -> 28x28
nn.ReLU(),
nn.MaxPool2d(2), # -> 14x14
nn.Conv2d(32, 64, kernel_size=3, padding=1), # -> 14x14
nn.ReLU(),
nn.MaxPool2d(2), # -> 7x7
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(64 * 7 * 7, 128),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(128, num_classes)
)
def forward(self, x):
x = self.features(x)
return self.classifier(x)
model = CNN(num_classes=10)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
5. Recurrent Neural Network (RNN)
Architecture: Contains recurrent connections that form a directed cycle, allowing it to maintain a hidden state that captures information from previous time steps. At each step, the hidden state is updated based on the current input and the previous hidden state.
Key Innovation: Temporal processing -- same weights applied at each time step, enabling variable-length sequence processing.
Use Cases: Short sequences, speech recognition, language modeling (largely replaced by LSTM/Transformer for longer sequences).
class SimpleRNN(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super().__init__()
self.rnn = nn.RNN(input_dim, hidden_dim, batch_first=True)
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
# x shape: (batch, seq_len, input_dim)
output, hidden = self.rnn(x)
# Use last hidden state
return self.fc(hidden.squeeze(0))
model = SimpleRNN(input_dim=10, hidden_dim=64, output_dim=2)
x = torch.randn(8, 20, 10) # batch=8, seq=20, features=10
print(f"Output shape: {model(x).shape}")
6. Long Short-Term Memory (LSTM)
Architecture: A special RNN with a cell state and three gates: forget gate (what to discard), input gate (what to store), and output gate (what to output). The cell state acts as a conveyor belt, allowing gradients to flow unchanged over long sequences.
Key Innovation: Solves the vanishing gradient problem that prevents standard RNNs from learning long-range dependencies.
Use Cases: Machine translation, speech recognition, time series forecasting, text generation, any long-sequence task.
class LSTMModel(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim, num_layers=2):
super().__init__()
self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers=num_layers,
batch_first=True, dropout=0.2)
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
output, (hidden, cell) = self.lstm(x)
return self.fc(hidden[-1]) # Last layer's hidden state
model = LSTMModel(input_dim=10, hidden_dim=128, output_dim=5, num_layers=2)
x = torch.randn(8, 50, 10)
print(f"Output shape: {model(x).shape}")
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
7. Gated Recurrent Unit (GRU)
Architecture: A simplified version of LSTM with only two gates: reset gate and update gate. Merges the cell state and hidden state into a single hidden state. Fewer parameters than LSTM but comparable performance on many tasks.
Key Innovation: Simpler gate mechanism achieves similar long-range dependency learning with fewer parameters and faster training.
Use Cases: Same as LSTM but when computational efficiency matters, smaller datasets, real-time applications.
class GRUModel(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim, num_layers=2):
super().__init__()
self.gru = nn.GRU(input_dim, hidden_dim, num_layers=num_layers,
batch_first=True, dropout=0.2)
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
output, hidden = self.gru(x)
return self.fc(hidden[-1])
model = GRUModel(input_dim=10, hidden_dim=128, output_dim=5)
print(f"GRU params: {sum(p.numel() for p in model.parameters()):,}")
# Compare: GRU has ~25% fewer params than LSTM with same hidden size
8. Transformer
Architecture: Based entirely on self-attention mechanisms (no recurrence or convolution). Uses multi-head attention to compute relationships between all positions in parallel, followed by feedforward layers. Encoder-decoder structure with positional encodings.
Key Innovation: Self-attention allows parallel processing of entire sequences and captures long-range dependencies without the sequential bottleneck of RNNs. Powers GPT, BERT, T5, and all modern language models.
Use Cases: NLP (GPT, BERT), computer vision (ViT), speech, protein folding, multimodal AI.
class TransformerClassifier(nn.Module):
def __init__(self, vocab_size, d_model=128, nhead=8, num_layers=2, num_classes=2):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.pos_encoding = nn.Parameter(torch.randn(1, 512, d_model))
encoder_layer = nn.TransformerEncoderLayer(
d_model=d_model, nhead=nhead, dim_feedforward=256,
dropout=0.1, batch_first=True
)
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
self.classifier = nn.Linear(d_model, num_classes)
def forward(self, x):
x = self.embedding(x) + self.pos_encoding[:, :x.size(1)]
x = self.transformer(x)
x = x.mean(dim=1) # Global average pooling
return self.classifier(x)
model = TransformerClassifier(vocab_size=10000)
x = torch.randint(0, 10000, (8, 100))
print(f"Output shape: {model(x).shape}")
9. Graph Neural Network (GNN)
Architecture: Operates on graph-structured data (nodes and edges). Each node aggregates information from its neighbors through message passing, then updates its own representation. Multiple rounds of message passing allow information to flow across the graph.
Key Innovation: Extends deep learning to irregular, non-Euclidean data structures.
Use Cases: Social network analysis, molecular property prediction, recommendation systems, traffic prediction.
# Using PyTorch Geometric (PyG)
# from torch_geometric.nn import GCNConv, global_mean_pool
# from torch_geometric.data import Data
# Conceptual GNN layer:
class SimpleGNNLayer(nn.Module):
def __init__(self, in_features, out_features):
super().__init__()
self.linear = nn.Linear(in_features, out_features)
def forward(self, x, adjacency):
# Message passing: aggregate neighbor features
aggregated = torch.matmul(adjacency, x) # Sum of neighbor features
return torch.relu(self.linear(aggregated))
layer = SimpleGNNLayer(16, 32)
print("GNN: learns on graph-structured data via message passing")
10. Graph Convolutional Network (GCN)
Architecture: Applies convolution-like operations on graphs using spectral graph theory. The key operation: H' = σ(D̂-1/2  D̂-1/2 H W), where  is the adjacency matrix with self-loops and D̂ is the degree matrix.
Key Innovation: Efficient spectral convolutions on graphs using first-order Chebyshev polynomial approximation.
Use Cases: Node classification (citation networks), semi-supervised learning on graphs, traffic networks.
# GCN with PyTorch Geometric
# import torch_geometric.nn as gnn
class GCNClassifier(nn.Module):
"""Conceptual GCN for node classification."""
def __init__(self, in_channels, hidden_channels, out_channels):
super().__init__()
self.conv1_weight = nn.Linear(in_channels, hidden_channels)
self.conv2_weight = nn.Linear(hidden_channels, out_channels)
def forward(self, x, adj_norm):
# Layer 1: A_norm @ X @ W1
x = torch.relu(adj_norm @ self.conv1_weight(x))
# Layer 2: A_norm @ H1 @ W2
x = adj_norm @ self.conv2_weight(x)
return x
print("GCN: spectral convolutions on graph-structured data")
11. Graph Attention Network (GAT)
Architecture: Applies attention mechanisms to graphs, learning different weights for different neighbors. Each node attends to its neighbors with learned attention coefficients, allowing the model to focus on the most relevant connections. Supports multi-head attention.
Key Innovation: Attention-based aggregation replaces fixed graph convolution weights, adapting to the local structure.
Use Cases: Graph classification, when different neighbors have different importance, heterogeneous graphs.
# GAT attention mechanism (conceptual)
class GATLayer(nn.Module):
def __init__(self, in_features, out_features, num_heads=4):
super().__init__()
self.num_heads = num_heads
self.W = nn.Linear(in_features, out_features * num_heads)
self.attention = nn.Linear(2 * out_features, 1)
self.out_features = out_features
def forward(self, x, edge_index):
# Multi-head attention on graph neighbors
h = self.W(x) # Transform features
# Compute attention coefficients for each edge
# alpha_ij = softmax(LeakyReLU(a^T [Wh_i || Wh_j]))
return h # Simplified
model = GATLayer(16, 8, num_heads=4)
print("GAT: attention-based message passing on graphs")
12. Autoencoder
Architecture: An encoder-decoder network trained to reconstruct its input. The encoder compresses input to a lower-dimensional bottleneck (latent representation), and the decoder reconstructs from the bottleneck. The network learns efficient data representations.
Key Innovation: Unsupervised feature learning through reconstruction, with the bottleneck forcing useful compression.
Use Cases: Dimensionality reduction, denoising, anomaly detection, pretraining, data compression.
class Autoencoder(nn.Module):
def __init__(self, input_dim=784, latent_dim=32):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 256),
nn.ReLU(),
nn.Linear(256, 64),
nn.ReLU(),
nn.Linear(64, latent_dim)
)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 64),
nn.ReLU(),
nn.Linear(64, 256),
nn.ReLU(),
nn.Linear(256, input_dim),
nn.Sigmoid()
)
def forward(self, x):
z = self.encoder(x)
return self.decoder(z)
def encode(self, x):
return self.encoder(x)
model = Autoencoder()
x = torch.randn(8, 784)
reconstructed = model(x)
print(f"Input: {x.shape}, Latent: {model.encode(x).shape}, Output: {reconstructed.shape}")
13. Variational Autoencoder (VAE)
Architecture: A probabilistic autoencoder that learns a smooth, continuous latent space. The encoder outputs a mean and variance (not a point), and samples from this distribution using the reparameterization trick. The loss combines reconstruction error and KL divergence.
Key Innovation: Generates new data by sampling from the learned latent distribution. The KL divergence term regularizes the latent space to be smooth and continuous.
Use Cases: Image generation, drug discovery, anomaly detection, disentangled representation learning.
class VAE(nn.Module):
def __init__(self, input_dim=784, latent_dim=20):
super().__init__()
# Encoder
self.encoder = nn.Sequential(nn.Linear(input_dim, 256), nn.ReLU())
self.fc_mu = nn.Linear(256, latent_dim)
self.fc_logvar = nn.Linear(256, latent_dim)
# Decoder
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 256), nn.ReLU(),
nn.Linear(256, input_dim), nn.Sigmoid()
)
def encode(self, x):
h = self.encoder(x)
return self.fc_mu(h), self.fc_logvar(h)
def reparameterize(self, mu, logvar):
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mu + eps * std
def forward(self, x):
mu, logvar = self.encode(x)
z = self.reparameterize(mu, logvar)
return self.decoder(z), mu, logvar
model = VAE()
x = torch.randn(8, 784)
recon, mu, logvar = model(x)
print(f"Latent mean shape: {mu.shape}")
14. Generative Adversarial Network (GAN)
Architecture: Two networks trained adversarially. The Generator creates fake data from random noise, trying to fool the Discriminator. The Discriminator tries to distinguish real data from generated fakes. Through this competition, the Generator learns to produce realistic data.
Key Innovation: Adversarial training -- no explicit density estimation needed. The game-theoretic framework drives both networks to improve.
Use Cases: Image generation (faces, art), style transfer, data augmentation, super-resolution, text-to-image.
class Generator(nn.Module):
def __init__(self, latent_dim=100, output_dim=784):
super().__init__()
self.model = nn.Sequential(
nn.Linear(latent_dim, 256), nn.LeakyReLU(0.2),
nn.Linear(256, 512), nn.LeakyReLU(0.2),
nn.Linear(512, output_dim), nn.Tanh()
)
def forward(self, z):
return self.model(z)
class Discriminator(nn.Module):
def __init__(self, input_dim=784):
super().__init__()
self.model = nn.Sequential(
nn.Linear(input_dim, 512), nn.LeakyReLU(0.2),
nn.Linear(512, 256), nn.LeakyReLU(0.2),
nn.Linear(256, 1), nn.Sigmoid()
)
def forward(self, x):
return self.model(x)
G = Generator()
D = Discriminator()
z = torch.randn(8, 100)
fake_images = G(z)
validity = D(fake_images)
print(f"Generated: {fake_images.shape}, Validity: {validity.shape}")