What every AI engineer needs to know about GPUs — Charles Frye, Modal

Why AI Engineers Need to Understand GPUs 00:01

  • Most AI applications have been built on top of model APIs (such as OpenAI, Anthropic) with developers staying on the application side of the API boundary
  • API boundaries are important for managing system complexity by isolating different expert domains
  • Using databases as an analogy: most developers do not build or manage databases but must understand how to use them efficiently (e.g., proper use of indices), similar principles apply to GPUs for AI engineers
  • As open weights models and open source software improve, it becomes more feasible for teams to run their own models and infrastructure, making GPU understanding increasingly relevant

Key GPU Principles for AI Workloads 04:45

  • The most important GPU feature for AI workloads is the ability to deliver high bandwidth, not low latency, which distinguishes GPUs from most other hardware
  • GPUs focus on maximizing math (arithmetic) bandwidth over memory bandwidth, excelling at computational throughput—especially low-precision matrix-matrix multiplication
  • The critical operation is low-precision matrix-matrix multiplication utilizing specialized hardware (tensor cores in NVIDIA GPUs)
  • The phrase "use the tensor cores" is a core theme: tensor cores outperform general-purpose CUDA cores for these computations

Bandwidth vs. Latency in Computing 05:42

  • Historically, computer performance improvements were driven by increasing clock speed and reducing latency, but latency improvements stalled in the early 2000s
  • With latency no longer scaling, modern performance gains depend on increasing bandwidth, notably through parallelism and concurrency
  • Parallelism (doing multiple calculations simultaneously each clock cycle) and concurrency (overlapping tasks to keep pipelines busy) are used at all levels, from hardware to programming models
  • GPUs drastically exceed CPUs in parallelism (e.g., NVIDIA H100 can run over 16,000 parallel threads at much lower wattage per thread)
  • GPUs also excel at fast context switching, maintaining high concurrency at the hardware level

Arithmetic Intensity and Practical Implications 12:49

  • The major advantage of GPUs is arithmetic (compute) bandwidth rather than memory bandwidth: efficient when the number of calculations per memory load is high (high arithmetic intensity)
  • Language model inference, especially during prompt processing, is well-suited because a large number of operations are performed per memory load
  • Decoding in language models is less efficient; batching or running many generations at once can better utilize GPU hardware
  • Smaller models run repeatedly (with multiple output generations and a selection mechanism) can match the quality of larger models and are more hardware-efficient for certain tasks

Modern GPU Features and Optimization Strategies 16:09

  • Latest GPU tensor cores are optimized for low-precision matrix-matrix multiplication, making certain AI tasks almost "free" if formulated correctly (e.g., multi-token prediction, multi-sample queries)
  • Matrix-matrix operations maximize GPU throughput; performance drops dramatically if only matrix-vector operations are used
  • Taking advantage of GPU hardware often involves restructuring problems to increase the number of batched or parallel calculations

Recommendations and Further Resources 18:05

  • Practical advice: running smaller models locally and using batching/sampling with appropriate verification (quality selection) can provide high performance and quality
  • Open models and improved tools make self-hosted, hardware-aware AI inference increasingly accessible for engineers
  • A "GPU glossary" resource is available at modal.com/GPU-glossery, offering explanations and links related to GPU architecture and programming
  • Modal provides a serverless platform for data- and compute-intensive workloads, with infrastructure designed to facilitate efficient AI (language model) inference workflows