AI Hardware

Master AI hardware end-to-end. 50 topics covering NVIDIA H100/H200/Blackwell, AMD MI300, Apple Silicon, Intel Gaudi, TPUs, Trainium, Cerebras, Groq, the GPU software stack (CUDA, Triton, ROCm, NCCL), inference engines (TensorRT-LLM, vLLM, SGLang, llama.cpp, MLX), memory and interconnect (HBM, NVLink, InfiniBand, RDMA, CXL), edge AI (Jetson, Coral, mobile NPUs), and AI data centers.

50 Topics
300 Lessons
7 Categories
100% Free

AI Hardware is the track we recommend to any engineer whose job involves making a GPU budget argument to finance or sizing an inference cluster. The hardware layer has become a competitive differentiator, not a commodity, and the difference between an H100, an A100, an L4, a Trainium instance, and an Apple Neural Engine is not a number in a spec sheet but a stack of real-world tradeoffs around memory bandwidth, interconnect, power envelope, and the software that actually runs on each.

We cover the major accelerator families, the system-level considerations (cooling, power, networking, NVLink versus PCIe), and the economics that determine whether renting is still cheaper than owning at your workload scale. We also cover the small-model and on-device side: mobile NPUs, edge AI, and quantization-for-hardware, because much of the next wave of applied AI runs outside the datacenter. The framing throughout is that hardware choices should follow workload shape, not vendor marketing.

All Topics

50 topics organized into 7 categories spanning the full AI hardware stack.

GPUs

🎯

NVIDIA H100 Mastery

Master the NVIDIA H100 — the workhorse GPU of the LLM era. Learn the Hopper architecture, FP8 training, transformer engine, NVLink, and the patterns that get you peak throughput.

6 Lessons
🚀

NVIDIA H200

Master the H200 — the H100 with 141GB of HBM3e and ~1.4x faster LLM inference. Learn what changed, when it matters, and the migration patterns from H100.

6 Lessons
🌟

NVIDIA B200/Blackwell

Master the Blackwell B200 and GB200 — NVIDIA's frontier GPU with FP4, second-gen transformer engine, and 5th-gen NVLink. Learn what 1 exaflop in a rack means for AI.

6 Lessons

NVIDIA A100

Master the A100 — still the workhorse for many AI workloads. Learn Ampere architecture, BF16, structured sparsity, MIG, and when A100 still beats H100 on $/throughput.

6 Lessons
🎥

NVIDIA L40S and L4

Master the L40S (Ada workstation/inference) and L4 (efficient inference). Learn when these beat H100/H200 for inference and graphics-AI hybrid workloads.

6 Lessons
🧠

NVIDIA Grace Hopper / Grace Blackwell

Master Grace Hopper (GH200) and Grace Blackwell (GB200) superchips — CPU+GPU sharing memory via NVLink-C2C. Learn unified memory, when it shines, and migration patterns.

6 Lessons
🔥

AMD MI300X / MI325X

Master AMD's flagship AI GPUs. Learn CDNA architecture, 192GB HBM3, ROCm software stack, and when MI300X/MI325X are a credible alternative to H100/H200.

6 Lessons
📱

Apple Silicon for AI (M3/M4)

Master Apple's M3 and M4 chips for AI work. Learn unified memory, Neural Engine, MLX framework, and the patterns for running 70B models on a MacBook.

6 Lessons
🔬

Intel Gaudi 2 / Gaudi 3

Master Intel Gaudi 2 and Gaudi 3 AI accelerators. Learn the architecture, SynapseAI software, integrated networking, and when Gaudi beats GPUs on $/training.

6 Lessons
🎮

Consumer GPUs for AI (RTX 4090/5090)

Master consumer GPUs for AI: RTX 4090, RTX 5090, dual-GPU rigs. Learn what fits in 24GB, the cost-per-token math, and when consumer GPUs beat datacenter parts.

6 Lessons

Specialized Accelerators

🔭

Google TPU v5p / Trillium

Master Google TPUs — purpose-built ML accelerators. Learn TPU v5p, Trillium (v6), MXU, JAX integration, and the patterns for training and serving on TPU pods.

6 Lessons

AWS Trainium 2

Master AWS Trainium 2 — purpose-built training chip. Learn Neuron SDK, NKI kernels, Trn2 instances, and when Trainium beats GPUs on $/training-hour.

6 Lessons
🔥

AWS Inferentia

Master AWS Inferentia 2 for low-cost inference. Learn Inf2 instances, model compilation, batching patterns, and when Inferentia beats GPUs for inference.

6 Lessons
🏹

Cerebras Wafer-Scale Engine

Master Cerebras WSE — the world's largest chip. Learn the wafer-scale architecture, weight streaming, and the patterns for fastest-on-earth LLM inference.

6 Lessons

Groq LPU

Master Groq LPU — deterministic inference at insane speed. Learn the Tensor Streaming Processor, why it's so fast on LLMs, and when Groq fits your workload.

6 Lessons
📊

SambaNova RDU

Master SambaNova RDU (Reconfigurable Dataflow Unit). Learn the dataflow architecture, SambaFlow software, and when RDUs beat GPUs for specific AI workloads.

6 Lessons
🔗

Graphcore IPU

Master Graphcore IPU — Intelligence Processing Unit. Learn the MIMD architecture, Poplar SDK, BOW IPU, and the niches where IPUs excel.

6 Lessons
🔮

Tenstorrent Wormhole / Blackhole

Master Tenstorrent: open-source RISC-V AI hardware. Learn Wormhole and Blackhole chips, TT-Metal SDK, and the open-hardware approach to AI acceleration.

6 Lessons
🛡

Etched Sohu (Transformer ASIC)

Master Etched Sohu — the first transformer-only ASIC. Learn what 'transformer baked in silicon' means, performance claims vs reality, and the implications.

6 Lessons
📝

Microsoft Maia 100

Master Microsoft Maia 100 — Azure's custom AI accelerator. Learn the architecture, Azure integration, and Microsoft's silicon strategy for OpenAI workloads.

6 Lessons

GPU Software Stack

💻

CUDA Programming for AI

Master CUDA C++ for AI work. Learn the programming model, kernels, shared memory, warp-level primitives, and the patterns to write GPU code that beats vendor libraries.

6 Lessons

OpenAI Triton Compiler

Master OpenAI Triton — Python-like GPU programming that compiles to high-performance kernels. Learn block programming, autotuning, and the patterns that beat hand-written CUDA.

6 Lessons
🛡

AMD ROCm Stack

Master ROCm — AMD's open-source GPU compute platform. Learn HIP, ROCm libraries, PyTorch on ROCm, and porting CUDA code to ROCm.

6 Lessons
🧠

cuDNN Deep Learning Library

Master cuDNN — NVIDIA's deep neural network primitives. Learn convolution algorithms, attention, batchnorm, and how to call cuDNN directly when frameworks fall short.

6 Lessons
🔬

cuBLAS and cuSPARSE

Master cuBLAS (dense) and cuSPARSE (sparse) linear algebra on GPU. Learn GEMM tuning, batched ops, mixed precision, and the patterns for max FLOPS.

6 Lessons
🔗

NCCL Collectives

Master NCCL — NVIDIA's multi-GPU communication library. Learn all-reduce, all-gather, broadcast, NVLink/InfiniBand topology, and tuning collectives for distributed training.

6 Lessons
📥

NVIDIA NIM Microservices

Master NVIDIA NIM — pre-optimized inference microservices. Learn how to deploy NIMs on Kubernetes, customize, and integrate with your existing AI stack.

6 Lessons
🚀

NVIDIA Dynamo

Master NVIDIA Dynamo — distributed inference framework for LLMs. Learn disaggregated prefill/decode, KV cache routing, and the patterns for max throughput at scale.

6 Lessons

Inference Engines

🎯

TensorRT-LLM

Master TensorRT-LLM — NVIDIA's optimized LLM inference engine. Learn engine compilation, FP8/INT4 quantization, in-flight batching, and the patterns for peak throughput.

6 Lessons

vLLM Internals

Go beyond using vLLM — master its internals. Learn PagedAttention, continuous batching, scheduler, KV cache, and the patterns to tune vLLM for your workload.

6 Lessons
🚀

SGLang Fast Inference

Master SGLang — fast LLM serving with structured generation. Learn RadixAttention, the SGLang frontend language, and when SGLang beats vLLM and TGI.

6 Lessons
🤗

HuggingFace TGI

Master HuggingFace Text Generation Inference. Learn deployment, quantization, multi-LoRA serving, and the patterns for production HuggingFace inference.

6 Lessons
🤗

llama.cpp Mastery

Master llama.cpp — the C++ inference engine that runs LLMs on anything. Learn GGUF, quantization formats, Metal/CUDA backends, and tuning for CPU, GPU, and edge.

6 Lessons
🍏

Apple MLX

Master Apple MLX — array framework optimized for Apple Silicon. Learn unified memory, lazy evaluation, MLX-LM for fast Apple Silicon LLM inference.

6 Lessons
🔗

ONNX Runtime

Master ONNX Runtime — cross-platform inference engine. Learn ONNX format, graph optimization, execution providers (CUDA, TensorRT, CPU), and the deployment patterns.

6 Lessons
🔬

Intel OpenVINO

Master Intel OpenVINO — inference toolkit for CPU, GPU, NPU, FPGA. Learn model conversion, optimization, deployment to Intel hardware, and edge inference patterns.

6 Lessons

Memory & Interconnect

Edge AI Hardware

Data Center & Cloud

Why an AI Hardware Track?

AI is now the most expensive workload in the data center. Understanding the silicon is what lets you ship fast, cheap, and reliably.

🎯

Every Major Vendor

NVIDIA, AMD, Apple, Intel, Google TPU, AWS Trainium, Cerebras, Groq, SambaNova, Graphcore, Tenstorrent, Etched, Microsoft Maia.

💻

The Software Stack

CUDA, Triton, ROCm, cuDNN, cuBLAS, NCCL, NIM, Dynamo — the layers between your model and the metal.

Inference Engines

TensorRT-LLM, vLLM, SGLang, TGI, llama.cpp, MLX, ONNX Runtime, OpenVINO — pick the right engine for the job.

🏠

Edge to Data Center

Jetson, Coral, mobile NPUs all the way up to 100kW racks, liquid cooling, and hyperscale AI clusters.