Intermediate

GPU & Compute Questions

GPU knowledge is foundational for any AI infrastructure role. These 12 questions cover what interviewers actually ask — from GPU architecture fundamentals to practical cost optimization. Each answer is written at the depth expected in a 45-minute technical interview.

Q1: Explain the GPU architecture and why GPUs are better than CPUs for deep learning.

💡

Answer: GPUs are massively parallel processors optimized for throughput, while CPUs are optimized for latency on sequential tasks. A modern GPU like the NVIDIA H100 has 16,896 CUDA cores organized into 132 Streaming Multiprocessors (SMs), compared to a CPU with 64–128 cores.

Deep learning workloads are dominated by matrix multiplications and convolutions — operations that are embarrassingly parallel. A single forward pass through a transformer layer involves multiplying matrices with millions of elements, where each element computation is independent. GPUs exploit this by executing thousands of threads simultaneously.

Key architectural differences that matter for ML:

Memory bandwidth: H100 HBM3 delivers 3.35 TB/s vs ~100 GB/s for DDR5 on CPUs. This is critical because ML workloads are often memory-bandwidth bound.
Tensor Cores: Specialized hardware for mixed-precision matrix multiply-accumulate operations. An H100 delivers 989 TFLOPS in FP16, vs ~2 TFLOPS for a high-end CPU.
SIMT execution: Single Instruction Multiple Threads — thousands of threads execute the same instruction on different data simultaneously.
Warp scheduling: GPUs hide memory latency by switching between warps (groups of 32 threads) rather than using large caches like CPUs.

Q2: What is the GPU memory hierarchy and why does it matter for training large models?

💡

Answer: The GPU memory hierarchy from fastest to slowest:

Registers: ~256 KB per SM, fastest access (~1 cycle). Each thread has its own register file. Limited quantity means register pressure can reduce occupancy.
Shared Memory / L1 Cache: 228 KB per SM on H100, configurable split. Shared among threads in a thread block. ~30 cycles latency. Used for inter-thread communication and data reuse within a block.
L2 Cache: 50 MB on H100. Shared across all SMs. ~200 cycles. Caches global memory accesses.
HBM (Global Memory): 80 GB on H100 SXM. ~400 cycles latency but massive bandwidth (3.35 TB/s). This is where model weights, activations, and gradients live.

Why this matters for large models: A 70B parameter model in FP16 requires ~140 GB just for weights, exceeding the 80 GB HBM on a single H100. You must use model parallelism, offloading, or quantization. During training, you also need memory for optimizer states (2x weights for Adam), gradients (1x weights), and activations (proportional to batch size and sequence length). This is why techniques like gradient checkpointing (recompute activations instead of storing them), mixed precision, and ZeRO optimizer sharding exist.

Q3: What is CUDA and how does the CUDA programming model work?

💡

Answer: CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and API that allows developers to use GPUs for general-purpose computing. The programming model has three key abstractions:

Kernels: Functions that execute on the GPU. Launched with a grid configuration specifying how many thread blocks and threads per block to use. Example: matmul_kernel<<<grid_dim, block_dim>>>(A, B, C)
Thread hierarchy: Threads are organized into blocks (up to 1,024 threads), blocks into grids. Threads within a block can synchronize and share memory. Threads across blocks cannot synchronize during kernel execution.
Memory model: Each thread has private registers, each block has shared memory, and all threads access global memory (HBM). The programmer manages data movement between host (CPU) and device (GPU) memory.

For AI infrastructure engineers, you rarely write raw CUDA kernels. But you need to understand CUDA concepts because: (1) debugging GPU errors requires understanding of CUDA error codes and memory models, (2) profiling tools like Nsight Systems report metrics in CUDA terminology, (3) understanding kernel launch overhead and memory transfers helps diagnose performance issues, and (4) CUDA driver and toolkit versions must be compatible with your ML framework versions.

Q4: How do you diagnose and resolve GPU out-of-memory (OOM) errors during training?

💡

Answer: GPU OOM is the most common training failure. Systematic diagnosis:

Step 1 — Understand what consumes GPU memory during training:

Model parameters: 2 bytes per parameter in FP16. A 7B model = 14 GB.
Optimizer states: Adam stores momentum and variance = 2x parameters in FP32 = 56 GB for 7B model.
Gradients: Same size as parameters = 14 GB in FP16.
Activations: Proportional to batch_size x sequence_length x hidden_dim x num_layers. This is often the largest consumer and the first thing to optimize.
CUDA context and fragmentation: ~500 MB–1 GB overhead. Memory fragmentation can waste 10–20% of available memory.

Step 2 — Solutions in order of impact:

Reduce batch size: Directly reduces activation memory. Use gradient accumulation to maintain effective batch size.
Enable gradient checkpointing: Trades compute for memory by recomputing activations during backward pass instead of storing them. Reduces activation memory by ~60% with ~30% compute overhead.
Use mixed precision (FP16/BF16): Halves memory for weights, gradients, and activations. BF16 preferred over FP16 because it has the same exponent range as FP32, avoiding overflow issues.
Use ZeRO optimizer sharding: DeepSpeed ZeRO-1 shards optimizer states across GPUs, ZeRO-2 adds gradient sharding, ZeRO-3 adds parameter sharding. Can reduce per-GPU memory by 4–8x.
CPU offloading: Move optimizer states or parameters to CPU memory. Increases training time due to PCIe transfers but enables training models that would otherwise not fit.

Q5: Compare NVIDIA GPU generations for AI training: V100, A100, H100, and B200.

💡

Feature	V100 (2017)	A100 (2020)	H100 (2022)	B200 (2024)
HBM	32 GB HBM2	80 GB HBM2e	80 GB HBM3	192 GB HBM3e
Memory BW	900 GB/s	2.0 TB/s	3.35 TB/s	8.0 TB/s
FP16 TFLOPS	125	312	989	2,250
Interconnect	NVLink 2.0 (300 GB/s)	NVLink 3.0 (600 GB/s)	NVLink 4.0 (900 GB/s)	NVLink 5.0 (1.8 TB/s)
Tensor Cores	1st gen	3rd gen (TF32, BF16)	4th gen (FP8)	5th gen (FP4, FP6)
Cloud Cost/hr	~$1.50	~$3.00	~$4.50	~$7.50

Key insight for interviews: Raw TFLOPS are not the only metric. Memory bandwidth often matters more because large model training is memory-bandwidth bound, not compute bound. The A100-to-H100 jump was significant because HBM3 provided 67% more bandwidth, and FP8 Tensor Cores enabled further speedups. For inference workloads, the B200's 192 GB HBM3e is transformative because it can fit a 70B model in FP16 on a single GPU without sharding.

Q6: What is mixed precision training and how does it work?

💡

Answer: Mixed precision training uses lower-precision formats (FP16 or BF16) for most operations while keeping a master copy of weights in FP32. This provides three benefits: (1) 2x memory reduction for weights, gradients, and activations, (2) 2–3x speedup from Tensor Core acceleration, and (3) reduced memory bandwidth pressure.

How it works (AMP — Automatic Mixed Precision):

Maintain FP32 master weights
Cast weights to FP16/BF16 for forward and backward pass
Compute gradients in FP16/BF16
Scale loss before backward pass to prevent gradient underflow (loss scaling)
Unscale gradients and update FP32 master weights

FP16 vs BF16: FP16 has 5 exponent bits and 10 mantissa bits — higher precision but smaller dynamic range. BF16 has 8 exponent bits (same as FP32) and 7 mantissa bits — larger dynamic range but lower precision. BF16 is preferred for training because it eliminates the need for loss scaling (no overflow/underflow issues). FP16 requires careful loss scaling management.

FP8 (H100+): Two variants: E4M3 (4 exponent, 3 mantissa) for forward pass and E5M2 (5 exponent, 2 mantissa) for backward pass. Provides 2x speedup over FP16 on Tensor Cores but requires careful calibration of scaling factors per tensor.

Q7: What is NVLink and why is it important for multi-GPU training?

💡

Answer: NVLink is NVIDIA's proprietary high-bandwidth, low-latency GPU-to-GPU interconnect. It bypasses the PCIe bus, which is the bottleneck for multi-GPU communication.

Bandwidth comparison:

PCIe Gen4 x16: 32 GB/s per direction (64 GB/s bidirectional)
PCIe Gen5 x16: 64 GB/s per direction (128 GB/s bidirectional)
NVLink 4.0 (H100): 900 GB/s total (18 links x 25 GB/s each direction)

Why it matters: Distributed training requires GPUs to exchange gradients (data parallelism) or activations (model/pipeline parallelism) every iteration. For a 7B parameter model in FP16, an AllReduce operation exchanges ~28 GB of gradient data. On PCIe Gen4, this takes ~875 ms. On NVLink 4.0, it takes ~31 ms — a 28x speedup that directly translates to faster training.

NVSwitch: In DGX systems, NVSwitch provides all-to-all NVLink connectivity. Without NVSwitch, NVLink uses point-to-point connections, so GPU 0 cannot directly communicate with GPU 7 in an 8-GPU system. NVSwitch acts as a crossbar switch, giving every GPU full NVLink bandwidth to every other GPU. The DGX H100 has 4 NVSwitches connecting 8 H100 GPUs with 900 GB/s all-to-all bandwidth.

Q8: How do you multi-GPU setups within a single node? What topologies exist?

💡

Answer: Multi-GPU topologies within a node determine communication patterns and performance:

PCIe-only: GPUs connected via PCIe switches. Cheapest but slowest inter-GPU communication. Common in consumer and entry-level servers. Bandwidth limited to PCIe gen speed.
NVLink pairs: Some GPUs connected via NVLink, others via PCIe. Common in workstation configurations. Need topology-aware placement to ensure communicating GPUs share NVLink.
NVLink + NVSwitch (DGX): All GPUs connected via NVSwitch for full-bisection bandwidth. The gold standard for training. DGX H100: 8x H100 GPUs, each with 900 GB/s to every other GPU.
NVLink Domain (GH200 / GB200 NVL72): NVIDIA's newest architecture connects up to 72 GPUs via NVLink in a single domain. Treats 72 GPUs as a single memory space with 13.5 TB of unified HBM. Designed for trillion-parameter models.

How to check topology: nvidia-smi topo -m shows the interconnect between GPUs (NVLink, PHB, SYS, etc.). Understanding the output is essential for debugging communication bottlenecks. In interviews, be prepared to read this output and explain why GPU placement matters for training performance.

Q9: Compare GPUs vs TPUs vs custom AI accelerators. When would you choose each?

💡

Feature	NVIDIA GPUs	Google TPUs	Custom (Trainium, Gaudi)
Flexibility	Runs any framework, any model, CUDA ecosystem	Optimized for TensorFlow/JAX, limited PyTorch support	Framework-specific, growing ecosystem
Performance	Highest for most workloads, best for research	Excellent for large-batch training, matmul-heavy models	Competitive for supported model architectures
Cost	Most expensive, high demand drives premium pricing	Competitive, available on-demand on GCP	Significantly cheaper (Trainium ~40% less than GPU equivalent)
Ecosystem	Largest: CUDA, cuDNN, TensorRT, Triton, NCCL	GCP-only, XLA compiler, limited tooling	Cloud-specific, limited third-party support
Best for	Research, production, any workload	Large-scale training on GCP, TPU pods	Cost-optimized training for supported models

Interview answer framework: Choose GPUs when you need maximum flexibility, bleeding-edge model support, or your team has CUDA expertise. Choose TPUs when you are on GCP, training large transformer models, and want cost-effective scale (TPU v5p pods provide 8,960 chips in a single cluster). Choose custom accelerators (AWS Trainium, Intel Gaudi) when cost optimization is the primary goal and your model architecture is well-supported.

Q10: How do you monitor GPU health and utilization in production?

💡

Answer: GPU monitoring involves three layers:

1. Hardware health (NVIDIA DCGM — Data Center GPU Manager):

GPU temperature, power draw, fan speed
ECC memory errors (correctable and uncorrectable). Correctable errors are normal; uncorrectable errors indicate failing memory.
XID errors: GPU hardware/driver errors. XID 79 (GPU fallen off the bus) and XID 48 (double-bit ECC error) require GPU replacement.
NVLink errors: CRC errors, replay counts. High replay counts indicate link degradation.

2. Utilization metrics:

GPU utilization %: Percentage of time the GPU has active kernels. Low utilization (<70%) during training usually indicates a data loading bottleneck or CPU preprocessing bottleneck.
GPU memory utilization: How much HBM is allocated. Near-100% is normal during training.
Tensor Core utilization: Available via Nsight or DCGM. Low Tensor Core utilization despite high GPU utilization means your workload is not using mixed precision or has non-matmul bottlenecks.
SM occupancy: Percentage of warps active per SM. Low occupancy wastes GPU potential.

3. Infrastructure tooling:

DCGM-exporter + Prometheus + Grafana: Standard stack. DCGM-exporter exposes GPU metrics as Prometheus endpoints. Build dashboards showing per-GPU utilization, memory, temperature, and error rates.
nvidia-smi: CLI tool for quick checks. nvidia-smi dmon for continuous monitoring. Not suitable for production — use DCGM.
Alert thresholds: Temperature >85C, uncorrectable ECC errors >0, GPU utilization <50% for >10 min during training, NVLink errors increasing.

Q11: How do you estimate the cost of a GPU training run?

💡

Answer: Cost estimation is a critical interview skill. Here is the framework:

Formula: Cost = (Number of GPUs) x (Hours per GPU) x (Cost per GPU-hour)

Example: Training a 7B parameter LLM on 1 trillion tokens

Compute estimate: Use the Chinchilla scaling law approximation: ~6 x N x D FLOPs, where N = parameters, D = tokens. 6 x 7B x 1T = 42e21 FLOPs = 42 ZettaFLOPs.
GPU throughput: H100 achieves ~500 TFLOPS sustained (out of 989 theoretical FP16) for LLM training. Per GPU per second: 500e12 FLOPs.
GPU-hours: 42e21 / 500e12 = 84e6 seconds = 23,333 GPU-hours.
With 64 GPUs: 23,333 / 64 = 365 hours = ~15 days wall-clock time.
Cost at $4.50/GPU-hour (on-demand): 23,333 x $4.50 = ~$105,000.
Cost at $1.50/GPU-hour (spot/reserved): ~$35,000.

Cost optimization strategies:

Use spot/preemptible instances (60–70% savings) with checkpointing every 15–30 minutes
Reserved instances for predictable workloads (30–40% savings)
Right-size GPU selection: do not use H100s for workloads that fit on A10s
Mixed precision training reduces compute time by 2–3x
Efficient data loading to maximize GPU utilization above 90%

Q12: What is Multi-Instance GPU (MIG) and when would you use it?

💡

Answer: MIG (Multi-Instance GPU) is an A100/H100 feature that partitions a single GPU into up to 7 isolated instances, each with dedicated compute, memory, and memory bandwidth. Each instance appears as a separate GPU to the application.

MIG profiles on H100 (80 GB):

1g.10gb: 1/7 of GPU, 10 GB HBM — for small inference models
2g.20gb: 2/7 of GPU, 20 GB HBM — for medium inference or development
3g.40gb: 3/7 of GPU, 40 GB HBM — for larger inference workloads
7g.80gb: Full GPU, 80 GB HBM — equivalent to non-MIG mode

When to use MIG:

Multi-tenant inference: Run multiple small models on one GPU with hardware isolation. Each tenant gets guaranteed resources and cannot interfere with others.
Development environments: Give each developer a GPU slice instead of a full GPU. More efficient than time-sharing because MIG provides memory isolation.
Mixed workloads: Run inference and light training on the same GPU without contention.

When NOT to use MIG: Training large models (you want the full GPU), workloads that need all the HBM, or when NVLink is required (MIG instances cannot use NVLink to other GPUs).

← Previous AI Infrastructure Roles Next → Distributed Training Questions