AI Hardware

Master AI hardware end-to-end. 50 topics covering NVIDIA H100/H200/Blackwell, AMD MI300, Apple Silicon, Intel Gaudi, TPUs, Trainium, Cerebras, Groq, the GPU software stack (CUDA, Triton, ROCm, NCCL), inference engines (TensorRT-LLM, vLLM, SGLang, llama.cpp, MLX), memory and interconnect (HBM, NVLink, InfiniBand, RDMA, CXL), edge AI (Jetson, Coral, mobile NPUs), and AI data centers.

Start Learning View All Topics

50 Topics

300 Lessons

7 Categories

100% Free

AI Hardware is the track we recommend to any engineer whose job involves making a GPU budget argument to finance or sizing an inference cluster. The hardware layer has become a competitive differentiator, not a commodity, and the difference between an H100, an A100, an L4, a Trainium instance, and an Apple Neural Engine is not a number in a spec sheet but a stack of real-world tradeoffs around memory bandwidth, interconnect, power envelope, and the software that actually runs on each.

We cover the major accelerator families, the system-level considerations (cooling, power, networking, NVLink versus PCIe), and the economics that determine whether renting is still cheaper than owning at your workload scale. We also cover the small-model and on-device side: mobile NPUs, edge AI, and quantization-for-hardware, because much of the next wave of applied AI runs outside the datacenter. The framing throughout is that hardware choices should follow workload shape, not vendor marketing.

All Topics

50 topics organized into 7 categories spanning the full AI hardware stack.

GPUs

NVIDIA H100 Mastery

Master the NVIDIA H100 — the workhorse GPU of the LLM era. Learn the Hopper architecture, FP8 training, transformer engine, NVLink, and the patterns that get you peak throughput.

NVIDIA H200

Master the H200 — the H100 with 141GB of HBM3e and ~1.4x faster LLM inference. Learn what changed, when it matters, and the migration patterns from H100.

NVIDIA B200/Blackwell

Master the Blackwell B200 and GB200 — NVIDIA's frontier GPU with FP4, second-gen transformer engine, and 5th-gen NVLink. Learn what 1 exaflop in a rack means for AI.

NVIDIA A100

Master the A100 — still the workhorse for many AI workloads. Learn Ampere architecture, BF16, structured sparsity, MIG, and when A100 still beats H100 on $/throughput.

NVIDIA L40S and L4

Master the L40S (Ada workstation/inference) and L4 (efficient inference). Learn when these beat H100/H200 for inference and graphics-AI hybrid workloads.

NVIDIA Grace Hopper / Grace Blackwell

Master Grace Hopper (GH200) and Grace Blackwell (GB200) superchips — CPU+GPU sharing memory via NVLink-C2C. Learn unified memory, when it shines, and migration patterns.

AMD MI300X / MI325X

Master AMD's flagship AI GPUs. Learn CDNA architecture, 192GB HBM3, ROCm software stack, and when MI300X/MI325X are a credible alternative to H100/H200.

Apple Silicon for AI (M3/M4)

Master Apple's M3 and M4 chips for AI work. Learn unified memory, Neural Engine, MLX framework, and the patterns for running 70B models on a MacBook.

Intel Gaudi 2 / Gaudi 3

Master Intel Gaudi 2 and Gaudi 3 AI accelerators. Learn the architecture, SynapseAI software, integrated networking, and when Gaudi beats GPUs on $/training.

Consumer GPUs for AI (RTX 4090/5090)

Master consumer GPUs for AI: RTX 4090, RTX 5090, dual-GPU rigs. Learn what fits in 24GB, the cost-per-token math, and when consumer GPUs beat datacenter parts.

Specialized Accelerators

Google TPU v5p / Trillium

Master Google TPUs — purpose-built ML accelerators. Learn TPU v5p, Trillium (v6), MXU, JAX integration, and the patterns for training and serving on TPU pods.

AWS Trainium 2

Master AWS Trainium 2 — purpose-built training chip. Learn Neuron SDK, NKI kernels, Trn2 instances, and when Trainium beats GPUs on $/training-hour.

AWS Inferentia

Master AWS Inferentia 2 for low-cost inference. Learn Inf2 instances, model compilation, batching patterns, and when Inferentia beats GPUs for inference.

Cerebras Wafer-Scale Engine

Master Cerebras WSE — the world's largest chip. Learn the wafer-scale architecture, weight streaming, and the patterns for fastest-on-earth LLM inference.

Groq LPU

Master Groq LPU — deterministic inference at insane speed. Learn the Tensor Streaming Processor, why it's so fast on LLMs, and when Groq fits your workload.

SambaNova RDU

Master SambaNova RDU (Reconfigurable Dataflow Unit). Learn the dataflow architecture, SambaFlow software, and when RDUs beat GPUs for specific AI workloads.

Graphcore IPU

Master Graphcore IPU — Intelligence Processing Unit. Learn the MIMD architecture, Poplar SDK, BOW IPU, and the niches where IPUs excel.

Tenstorrent Wormhole / Blackhole

Master Tenstorrent: open-source RISC-V AI hardware. Learn Wormhole and Blackhole chips, TT-Metal SDK, and the open-hardware approach to AI acceleration.

Etched Sohu (Transformer ASIC)

Master Etched Sohu — the first transformer-only ASIC. Learn what 'transformer baked in silicon' means, performance claims vs reality, and the implications.

Microsoft Maia 100

Master Microsoft Maia 100 — Azure's custom AI accelerator. Learn the architecture, Azure integration, and Microsoft's silicon strategy for OpenAI workloads.

GPU Software Stack

CUDA Programming for AI

Master CUDA C++ for AI work. Learn the programming model, kernels, shared memory, warp-level primitives, and the patterns to write GPU code that beats vendor libraries.

OpenAI Triton Compiler

Master OpenAI Triton — Python-like GPU programming that compiles to high-performance kernels. Learn block programming, autotuning, and the patterns that beat hand-written CUDA.

AMD ROCm Stack

Master ROCm — AMD's open-source GPU compute platform. Learn HIP, ROCm libraries, PyTorch on ROCm, and porting CUDA code to ROCm.

cuDNN Deep Learning Library

Master cuDNN — NVIDIA's deep neural network primitives. Learn convolution algorithms, attention, batchnorm, and how to call cuDNN directly when frameworks fall short.

cuBLAS and cuSPARSE

Master cuBLAS (dense) and cuSPARSE (sparse) linear algebra on GPU. Learn GEMM tuning, batched ops, mixed precision, and the patterns for max FLOPS.

NCCL Collectives

Master NCCL — NVIDIA's multi-GPU communication library. Learn all-reduce, all-gather, broadcast, NVLink/InfiniBand topology, and tuning collectives for distributed training.

NVIDIA NIM Microservices

Master NVIDIA NIM — pre-optimized inference microservices. Learn how to deploy NIMs on Kubernetes, customize, and integrate with your existing AI stack.

NVIDIA Dynamo

Master NVIDIA Dynamo — distributed inference framework for LLMs. Learn disaggregated prefill/decode, KV cache routing, and the patterns for max throughput at scale.

Inference Engines

TensorRT-LLM

Master TensorRT-LLM — NVIDIA's optimized LLM inference engine. Learn engine compilation, FP8/INT4 quantization, in-flight batching, and the patterns for peak throughput.

vLLM Internals

Go beyond using vLLM — master its internals. Learn PagedAttention, continuous batching, scheduler, KV cache, and the patterns to tune vLLM for your workload.

SGLang Fast Inference

Master SGLang — fast LLM serving with structured generation. Learn RadixAttention, the SGLang frontend language, and when SGLang beats vLLM and TGI.

HuggingFace TGI

Master HuggingFace Text Generation Inference. Learn deployment, quantization, multi-LoRA serving, and the patterns for production HuggingFace inference.

llama.cpp Mastery

Master llama.cpp — the C++ inference engine that runs LLMs on anything. Learn GGUF, quantization formats, Metal/CUDA backends, and tuning for CPU, GPU, and edge.

Apple MLX

Master Apple MLX — array framework optimized for Apple Silicon. Learn unified memory, lazy evaluation, MLX-LM for fast Apple Silicon LLM inference.

ONNX Runtime

Master ONNX Runtime — cross-platform inference engine. Learn ONNX format, graph optimization, execution providers (CUDA, TensorRT, CPU), and the deployment patterns.

Intel OpenVINO

Master Intel OpenVINO — inference toolkit for CPU, GPU, NPU, FPGA. Learn model conversion, optimization, deployment to Intel hardware, and edge inference patterns.

Memory & Interconnect

HBM (High Bandwidth Memory)

Master HBM — the memory tech that powers AI. Learn HBM2/HBM3/HBM3e, stacking, bandwidth math, and how memory bandwidth becomes the LLM bottleneck.

NVLink and NVSwitch

Master NVLink and NVSwitch — NVIDIA's GPU interconnect. Learn the bandwidth gen-by-gen, topology, NVL72 racks, and the patterns for max all-reduce performance.

InfiniBand for AI Clusters

Master InfiniBand for AI clusters. Learn 400/800Gb HDR/NDR/XDR, Mellanox/NVIDIA Quantum switches, fat-tree topology, and the patterns for cluster networking.

RDMA and RoCE

Master RDMA — Remote Direct Memory Access. Learn RoCE v2 over Ethernet, GPUDirect RDMA, and the patterns for low-latency multi-host GPU communication.

CXL Memory Pooling

Master CXL — Compute Express Link. Learn cache-coherent memory pooling, CXL 2.0/3.0, memory disaggregation, and the patterns for tomorrow's AI infrastructure.

PCIe for AI

Master PCIe for AI: Gen4, Gen5, Gen6. Learn lanes, bifurcation, bandwidth math, and when PCIe becomes the bottleneck for GPU-to-GPU and GPU-to-host data flow.

Edge AI Hardware

NVIDIA Jetson

Master NVIDIA Jetson: Orin Nano, Orin NX, AGX Orin. Learn the architecture, JetPack SDK, deployment patterns, and the use cases where Jetson dominates edge AI.

Google Coral Edge TPU

Master Google Coral Edge TPU — fast, low-power AI for the edge. Learn the Edge TPU compiler, quantization requirements, and the deployment patterns.

Mobile AI NPUs (Apple, Qualcomm)

Master mobile AI accelerators: Apple Neural Engine, Qualcomm Hexagon NPU, Samsung NPU. Learn the on-device AI patterns for iOS and Android.

Hailo Edge Accelerators

Master Hailo-8 and Hailo-15 edge accelerators. Learn the dataflow architecture, Hailo SDK, and when Hailo beats Jetson and Coral for edge inference.

Raspberry Pi AI HAT

Master Raspberry Pi for AI: Pi 5, AI HAT (Hailo), AI Camera. Learn what's possible on a $100 board, deployment patterns, and the hobbyist-to-production path.

Edge Inference Optimization

Optimize models for edge: quantization, pruning, distillation, hardware-aware NAS. Learn the patterns to ship sub-100ms, sub-100MB AI on phones and embedded devices.

Data Center & Cloud

AI Data Center Design and Cooling

Design AI data centers from the rack up. Learn power density (100kW+ racks), liquid cooling, immersion, sustainability, and the patterns for hyperscale AI infrastructure.

Cloud GPU Providers

Pick the right cloud GPU provider. Compare AWS, GCP, Azure, Oracle, Lambda Labs, CoreWeave, RunPod, Modal, Together, Replicate. Learn pricing, availability, and quirks.

Why an AI Hardware Track?

AI is now the most expensive workload in the data center. Understanding the silicon is what lets you ship fast, cheap, and reliably.

🎯

Every Major Vendor

NVIDIA, AMD, Apple, Intel, Google TPU, AWS Trainium, Cerebras, Groq, SambaNova, Graphcore, Tenstorrent, Etched, Microsoft Maia.

💻

The Software Stack

CUDA, Triton, ROCm, cuDNN, cuBLAS, NCCL, NIM, Dynamo — the layers between your model and the metal.

⚡

Inference Engines

TensorRT-LLM, vLLM, SGLang, TGI, llama.cpp, MLX, ONNX Runtime, OpenVINO — pick the right engine for the job.

🏠

Edge to Data Center

Jetson, Coral, mobile NPUs all the way up to 100kW racks, liquid cooling, and hyperscale AI clusters.