Intermediate

GPU Computing Fundamentals

Master the foundations of GPU computing: CUDA programming model, GPU architecture, memory hierarchy, and parallel execution patterns. These concepts are essential for understanding how deep learning workloads are accelerated on NVIDIA hardware.

GPU Architecture Overview

NVIDIA GPUs are designed for massively parallel computation. Understanding the hardware architecture helps you write efficient deep learning code and debug performance bottlenecks.

# NVIDIA GPU Architecture - Key Components

gpu_architecture = {
    "Streaming Multiprocessors (SMs)": {
        "description": "The fundamental compute unit of NVIDIA GPUs",
        "contains": [
            "CUDA Cores - handle FP32/INT32 operations",
            "Tensor Cores - specialized for matrix multiply-accumulate (MMA)",
            "RT Cores - ray tracing (not used for DL training)",
            "Shared Memory / L1 Cache - fast on-chip storage",
            "Warp Schedulers - manage thread execution"
        ],
        "example": "A100 has 108 SMs, each with 64 CUDA cores = 6,912 total"
    },
    "Memory Hierarchy": {
        "registers": "Fastest, per-thread, ~256KB per SM",
        "shared_memory": "Fast, per-block, 48-164KB per SM (configurable)",
        "l1_cache": "On-chip, combined with shared memory",
        "l2_cache": "On-chip, shared across all SMs (e.g., 40MB on A100)",
        "global_memory": "HBM2/HBM2e, high bandwidth (e.g., 80GB on A100)",
        "bandwidth": "A100: 2TB/s HBM2e bandwidth"
    },
    "Key GPU Families for Deep Learning": {
        "Data Center": "A100, H100, H200 (training & inference)",
        "Professional": "RTX 6000 Ada, A6000 (workstations)",
        "Consumer": "RTX 4090, RTX 3090 (development & small training)",
        "Edge": "Jetson Orin (embedded inference)"
    }
}

CUDA Programming Model

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform. While deep learning practitioners rarely write raw CUDA kernels, understanding the model is critical for the certification and for optimizing performance.

# CUDA Thread Hierarchy

cuda_model = {
    "Thread": {
        "description": "Smallest unit of execution",
        "identified_by": "threadIdx.x, threadIdx.y, threadIdx.z",
        "note": "Each thread has its own registers and local memory"
    },
    "Block (Thread Block)": {
        "description": "Group of threads that execute on a single SM",
        "identified_by": "blockIdx.x, blockIdx.y, blockIdx.z",
        "max_threads": "1024 threads per block",
        "shared_memory": "Threads in same block can share data via shared memory",
        "synchronization": "Threads in same block can synchronize with __syncthreads()"
    },
    "Grid": {
        "description": "Collection of blocks that execute a kernel",
        "dimensions": "Can be 1D, 2D, or 3D",
        "note": "Blocks in a grid execute independently (no cross-block sync)"
    },
    "Warp": {
        "description": "Group of 32 threads executed in lockstep (SIMT)",
        "key_concept": "All 32 threads in a warp execute the same instruction",
        "divergence": "If threads in a warp take different branches, both paths execute serially",
        "importance": "Warp divergence is a major performance concern"
    }
}

# Example: Launching a CUDA Kernel
"""
// C++ CUDA kernel for vector addition
__global__ void vectorAdd(float* a, float* b, float* c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

// Launch with 256 threads per block
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<>>(a, b, c, n);
"""

Memory Hierarchy Deep Dive

Memory management is one of the most important aspects of GPU performance. Choosing the right memory type can make orders-of-magnitude difference in performance.

# GPU Memory Types and When to Use Them

memory_hierarchy = {
    "Registers (Fastest)": {
        "speed": "~0 cycle latency",
        "scope": "Per-thread",
        "size": "~256KB per SM",
        "use_case": "Local variables, loop counters",
        "tip": "Compiler manages automatically; avoid register spilling"
    },
    "Shared Memory (Very Fast)": {
        "speed": "~5 cycle latency",
        "scope": "Per-block (all threads in block can access)",
        "size": "48-164KB per SM (configurable)",
        "use_case": "Data shared between threads in a block",
        "tip": "Use for tiled matrix multiplication, reduction operations"
    },
    "L1 / L2 Cache": {
        "speed": "L1: ~30 cycles, L2: ~200 cycles",
        "scope": "L1 per-SM, L2 shared across GPU",
        "size": "L1: combined with shared mem, L2: 6-40MB",
        "use_case": "Automatic caching of global memory accesses",
        "tip": "Coalesced memory access patterns maximize cache efficiency"
    },
    "Global Memory (HBM)": {
        "speed": "~400-600 cycle latency",
        "scope": "All threads on GPU",
        "size": "16-80GB (depending on GPU model)",
        "use_case": "Model weights, input data, output buffers",
        "tip": "Coalesced access critical; avoid random access patterns"
    },
    "Host (CPU) Memory": {
        "speed": "~10,000+ cycles (PCIe transfer)",
        "scope": "CPU only (must transfer to GPU)",
        "use_case": "Data loading, preprocessing before GPU transfer",
        "tip": "Minimize host-device transfers; use pinned memory for faster copies"
    }
}

# Key Concept: Memory Coalescing
"""
GOOD: Threads access consecutive memory addresses (coalesced)
  Thread 0 reads addr[0], Thread 1 reads addr[1], ...
  -> Single memory transaction for entire warp

BAD: Threads access scattered addresses (uncoalesced)
  Thread 0 reads addr[0], Thread 1 reads addr[100], ...
  -> Multiple memory transactions, wastes bandwidth
"""

GPU vs CPU: When GPUs Win

GPU Strengths

Massively parallel tasks: Matrix multiplications, convolutions, element-wise operations. Deep learning training and inference are ideal GPU workloads because they involve millions of independent arithmetic operations.

CPU Strengths

Sequential, branching logic: Data preprocessing, file I/O, complex control flow. CPUs have larger caches, higher clock speeds per core, and better branch prediction for serial tasks.

Tensor Cores

Matrix multiply-accumulate: Tensor Cores perform 4x4 matrix operations in a single cycle. They are the key to mixed precision training (FP16 compute, FP32 accumulate) and deliver 10-20x speedup over CUDA cores for DL.

Key Metric: FLOPS

A100: 312 TFLOPS (FP16 Tensor), 19.5 TFLOPS (FP32). H100: 989 TFLOPS (FP16 Tensor), 67 TFLOPS (FP32). Compare to CPU: ~1 TFLOPS FP32. GPUs are 20-1000x faster for parallel math.

Practical GPU Commands

# Essential GPU commands for certification prep

# 1. Check GPU status
# $ nvidia-smi
"""
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03   Driver Version: 535.129.03   CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  NVIDIA A100-SXM4   On    | 00000000:00:04.0 Off |                    0 |
| N/A   30C    P0    52W / 400W |   2048MiB / 81920MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
"""

# 2. Check CUDA availability in PyTorch
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}")
print(f"GPU memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")

# 3. Check CUDA availability in TensorFlow
import tensorflow as tf
print(f"GPUs: {tf.config.list_physical_devices('GPU')}")
print(f"Built with CUDA: {tf.test.is_built_with_cuda()}")

# 4. Monitor GPU in real-time
# $ watch -n 1 nvidia-smi

# 5. Check CUDA toolkit version
# $ nvcc --version

Practice Questions

💡
Test your understanding:
  1. What is the difference between a CUDA core and a Tensor Core? When is each used?
  2. Explain the thread hierarchy: thread, warp, block, grid. What is the maximum number of threads per block?
  3. Why is memory coalescing important for GPU performance? Give an example of coalesced vs uncoalesced access.
  4. What is warp divergence and why does it reduce performance?
  5. Compare shared memory and global memory in terms of latency, scope, and typical use cases.
  6. An A100 has 80GB of HBM2e. If your model weights are 20GB and your batch requires 30GB of activations, will it fit? What strategies can reduce memory usage?
  7. What command would you use to check GPU utilization and memory usage from the terminal?

Key Takeaways

💡
  • GPUs achieve massive parallelism through thousands of CUDA cores organized into Streaming Multiprocessors (SMs)
  • The CUDA thread hierarchy (thread, warp, block, grid) maps software parallelism to hardware
  • Memory hierarchy matters: registers are fastest, global memory is slowest — coalesced access is critical
  • Tensor Cores provide 10-20x speedup for matrix operations used in deep learning
  • Use nvidia-smi to monitor GPU utilization and torch.cuda or tf.config to verify GPU access in code
  • Warp divergence (threads in a warp taking different branches) is a major performance concern