Advanced

Emerging Architectures

Look beyond Transformers and into the future of AI — from brain-inspired spiking networks to mathematically elegant KANs, discover the architectures that could define the next era of artificial intelligence.

Kolmogorov-Arnold Networks (KAN)

KANs (Liu et al., 2024) represent a fundamental rethinking of neural network design. While standard neural networks (MLPs) have fixed activation functions on nodes and learnable linear weights on edges, KANs flip this: they place learnable activation functions on edges and use simple summation at nodes.

  • Mathematical basis: KANs are inspired by the Kolmogorov-Arnold representation theorem, which states that any continuous multivariate function can be decomposed into compositions of continuous single-variable functions and addition.
  • Learnable activations: Each edge has its own activation function, parameterized as a B-spline. This is more expressive than fixed ReLU or GELU activations.
  • Accuracy advantages: KANs can achieve the same accuracy as MLPs with far fewer parameters, especially for scientific computing and function approximation tasks.
  • Interpretability: The learned spline functions on each edge are human-inspectable, making KANs more interpretable than standard neural networks. Symbolic regression can extract closed-form mathematical expressions from trained KANs.
  • Current limitations: KANs are slower to train than MLPs (B-spline computation is more expensive than matrix multiplication), and their advantage over standard architectures for large-scale deep learning tasks (like language modeling) is not yet established.

Liquid Neural Networks

Liquid Neural Networks (Hasani et al., MIT, 2021) are inspired by the nervous system of the C. elegans worm (which has only 302 neurons). They use continuous-time dynamics with time-varying parameters:

  • Continuous-time ODE: Unlike standard RNNs with discrete time steps, liquid networks model neuron dynamics as continuous differential equations whose parameters change over time.
  • Extreme compactness: A liquid network with just 19 neurons can learn to steer a car, compared to thousands of neurons in conventional architectures. This makes them ideal for edge deployment.
  • Causal reasoning: Liquid networks naturally learn causal relationships in sequential data, making them excellent for time-series analysis and control tasks.
  • Adaptability: The time-varying parameters allow the network to adapt its behavior based on the input, providing a form of continuous-time attention.
  • Applications: Autonomous driving (tested in real self-driving cars), robotic control, medical time-series monitoring, and weather prediction.

Neural ODEs

Neural Ordinary Differential Equations (Chen et al., 2018) treat the hidden state dynamics of a neural network as a continuous transformation defined by an ODE:

  • Continuous depth: Instead of a fixed number of discrete layers, Neural ODEs define a continuous transformation dh/dt = f(h, t, theta), where f is a neural network. The output is obtained by solving this ODE from time t=0 to t=1.
  • Adaptive computation: The ODE solver adaptively chooses the number of evaluation steps based on the complexity of the input, using more computation for harder examples.
  • Memory efficiency: The adjoint method allows training with O(1) memory regardless of the number of ODE solver steps (compared to O(L) memory for L layers in standard networks).
  • Continuous normalizing flows: Neural ODEs enable continuous-time generative models that transform a simple distribution into a complex data distribution through a smooth ODE trajectory.
  • Limitations: Training is slower due to ODE solving in both forward and backward passes, and the expressiveness of the continuous dynamics is restricted compared to discrete layers.

Spiking Neural Networks (SNNs)

Spiking Neural Networks are the "third generation" of neural networks, more closely mimicking biological neurons. Unlike standard artificial neurons that output continuous values, spiking neurons communicate through discrete spikes (binary events in time):

  • Temporal coding: Information is encoded in the timing and frequency of spikes, not in continuous activation values. This enables rich temporal representations.
  • Event-driven computation: Neurons only compute when they receive or emit a spike. Between spikes, no computation occurs, leading to extreme energy efficiency for sparse, event-driven data.
  • Biological plausibility: SNNs model the leaky integrate-and-fire (LIF) dynamics of real neurons, including membrane potential, threshold firing, and refractory periods.
  • Energy efficiency: On neuromorphic hardware (see below), SNNs can be 100-1000x more energy efficient than equivalent ANNs for certain tasks.
  • Challenges: Training SNNs is difficult because spikes are non-differentiable (requiring surrogate gradient methods), and current deep learning frameworks are optimized for continuous-valued networks.

Neuromorphic Computing

Neuromorphic chips are specialized hardware designed to run spiking neural networks efficiently. They mimic the brain's architecture rather than the von Neumann architecture used by GPUs and CPUs:

  • Intel Loihi 2: Intel's second-generation neuromorphic chip with 1 million neurons and 120 million synapses. Up to 100x more energy efficient than GPUs for SNN workloads.
  • IBM NorthPole: A digital inference chip inspired by neuromorphic principles, achieving 12 trillion operations per second per watt for neural network inference.
  • SpiNNaker 2: A million-core neuromorphic supercomputer designed to simulate large-scale brain models in real time.
  • Applications: Ultra-low-power edge AI (hearing aids, wearables, IoT sensors), real-time event processing (dynamic vision sensors), and brain-computer interfaces.

Test-Time Training (TTT)

Test-Time Training (Sun et al., 2024) challenges the traditional separation between training and inference. Instead of using a fixed model at test time, TTT continues to learn during inference:

  • TTT layers: Replace standard linear attention or MLP layers with layers that perform gradient-based self-supervised learning on the test input. Each token's hidden state is updated by a small inner training loop.
  • Adaptive representations: The model adapts its internal representations to each specific input, potentially handling distribution shifts and novel patterns better than fixed models.
  • Linear complexity: TTT layers can achieve the expressiveness of attention while maintaining O(n) complexity, similar to SSMs but with stronger adaptation capabilities.
  • Trade-offs: TTT adds computational overhead per token (due to the inner optimization) and requires careful design of the self-supervised objective.

Hyena Architecture

Hyena (Poli et al., 2023) replaces the attention mechanism with a combination of long convolutions and element-wise gating. It achieves sub-quadratic complexity while maintaining the quality of attention-based models:

  • Uses implicitly parameterized long convolutions (learned via a small neural network that generates the convolution filter)
  • Applies element-wise multiplicative gating for input-dependent processing
  • Achieves O(n log n) complexity via FFT-based convolution
  • Matches Transformer quality on language modeling benchmarks up to medium scale
  • The Hyena concept influenced StripedHyena and other sub-quadratic architectures

RETRO: Retrieval-Enhanced Transformers

RETRO (Borgeaud et al., DeepMind, 2022) augments Transformers with explicit retrieval from an external database during inference, reducing the need to store all knowledge in model weights:

  • Retrieval mechanism: For each chunk of input text, RETRO retrieves similar text passages from a large database (e.g., 2 trillion tokens) using nearest-neighbor search.
  • Cross-attention fusion: Retrieved passages are integrated into the model's processing through cross-attention layers, allowing the model to condition its output on relevant retrieved information.
  • Efficiency: A 7.5B parameter RETRO model matches the performance of a 280B parameter model that must store all knowledge in its weights, representing a massive efficiency gain.
  • Updatable knowledge: The retrieval database can be updated without retraining the model, solving the knowledge cutoff problem.
  • Influence: RETRO's approach influenced RAG (Retrieval-Augmented Generation) systems, which are now standard in production LLM deployments.

Modular and Composable Architectures

The trend toward modularity reflects the idea that monolithic architectures may not be optimal for all tasks. Composable architectures allow combining specialized components:

  • Adapters and LoRA: Small trainable modules inserted into frozen pre-trained models, enabling efficient task-specific adaptation without modifying the base model.
  • Modular networks: Systems where different modules handle different sub-tasks (e.g., separate modules for vision, language, and reasoning), composed dynamically based on the input.
  • Tool-augmented models: LLMs that can call external tools (calculators, code interpreters, search engines) as modular capabilities, extending their competence without architectural changes.
  • Multi-expert routing: Beyond MoE, future architectures may dynamically compose specialized sub-networks from a larger model based on task requirements.

Architecture Search (NAS)

Neural Architecture Search (NAS) automates the design of neural network architectures, using AI to design AI:

  • Search space: Define the space of possible architectures (types of layers, connections, hyperparameters).
  • Search strategy: Use reinforcement learning, evolutionary algorithms, or gradient-based methods to explore the search space efficiently.
  • Notable results: NAS discovered EfficientNet (which outperformed hand-designed CNNs) and NASNet (competitive image classification with novel architectures humans had not considered).
  • Limitations: NAS is extremely compute-intensive (the original NAS paper used 500 GPUs for weeks). Hardware-aware NAS and weight-sharing strategies have reduced costs but it remains expensive.
  • Future: As AI becomes more capable, AI-designed architectures may eventually surpass human-designed ones, creating a positive feedback loop in AI development.

The Convergence Trend: Multimodal Architectures

Perhaps the most important architectural trend is convergence. Rather than specialized architectures for each modality (CNNs for images, RNNs for audio, Transformers for text), modern architectures are increasingly unified and multimodal:

  • Vision Transformers (ViT): Showed that the same Transformer architecture works for images (by treating image patches as tokens), eliminating the need for specialized CNN architectures.
  • GPT-4o and Gemini: Process text, images, audio, and video through a single unified architecture, with shared representations across modalities.
  • Any-to-any models: Emerging architectures that can take any combination of modalities as input and produce any modality as output, moving toward truly general-purpose AI systems.
  • Tokenization as the universal interface: The trend toward converting all data types (text, images, audio, video, actions, sensor readings) into token sequences that can be processed by a single sequence model.

What's Next: Architecture Landscape

Architecture Status Potential Impact Timeline
KAN Early research Scientific computing, interpretability 2-5 years
Liquid Networks Active deployment Edge AI, robotics, autonomous systems 1-3 years
Neural ODEs Niche adoption Physics simulation, generative models 2-4 years
Spiking NNs Hardware-dependent Ultra-low-power AI, brain interfaces 3-7 years
TTT Layers Early research Adaptive inference, long context 2-4 years
Hyena / Sub-quadratic Active research Efficient long-sequence processing 1-2 years
RETRO / Retrieval Production (as RAG) Knowledge-augmented models Now
Multimodal unified Active deployment General-purpose AI systems Now - 2 years
AI-designed architectures Early research Architectures beyond human design 3-10 years
💡
The big picture: The Transformer will likely remain dominant for the next few years, but it will be augmented and partially replaced by ideas from these emerging architectures. The future is hybrid: combining attention's recall strength with SSMs' efficiency, MoE's scalability, retrieval's knowledge access, and perhaps even neuromorphic hardware's energy efficiency. The best architectures of tomorrow will likely draw from all of these approaches.

Course Completion

Congratulations on completing the AI Architecture course!

You have journeyed from foundational architectures like CNNs and RNNs through the dominant Transformer paradigm, explored generative architectures (GANs, diffusion models, autoencoders), studied efficiency innovations (MoE, SSMs), learned about specialized architectures (GNNs), and glimpsed the future with emerging designs.

You now have a comprehensive understanding of the architectural landscape that powers modern AI. This knowledge will serve you well whether you are building AI applications, fine-tuning models, or pushing the boundaries of what these architectures can do.

Suggested next steps: Explore the AI Design Patterns course to learn how to combine these architectures into production systems, or dive deeper into Deep Learning for hands-on implementation practice.