Most AI applications have been built on top of model APIs (such as OpenAI, Anthropic) with developers staying on the application side of the API boundary
API boundaries are important for managing system complexity by isolating different expert domains
Using databases as an analogy: most developers do not build or manage databases but must understand how to use them efficiently (e.g., proper use of indices), similar principles apply to GPUs for AI engineers
As open weights models and open source software improve, it becomes more feasible for teams to run their own models and infrastructure, making GPU understanding increasingly relevant
The most important GPU feature for AI workloads is the ability to deliver high bandwidth, not low latency, which distinguishes GPUs from most other hardware
GPUs focus on maximizing math (arithmetic) bandwidth over memory bandwidth, excelling at computational throughput—especially low-precision matrix-matrix multiplication
The critical operation is low-precision matrix-matrix multiplication utilizing specialized hardware (tensor cores in NVIDIA GPUs)
The phrase "use the tensor cores" is a core theme: tensor cores outperform general-purpose CUDA cores for these computations
Historically, computer performance improvements were driven by increasing clock speed and reducing latency, but latency improvements stalled in the early 2000s
With latency no longer scaling, modern performance gains depend on increasing bandwidth, notably through parallelism and concurrency
Parallelism (doing multiple calculations simultaneously each clock cycle) and concurrency (overlapping tasks to keep pipelines busy) are used at all levels, from hardware to programming models
GPUs drastically exceed CPUs in parallelism (e.g., NVIDIA H100 can run over 16,000 parallel threads at much lower wattage per thread)
GPUs also excel at fast context switching, maintaining high concurrency at the hardware level
Arithmetic Intensity and Practical Implications 12:49
The major advantage of GPUs is arithmetic (compute) bandwidth rather than memory bandwidth: efficient when the number of calculations per memory load is high (high arithmetic intensity)
Language model inference, especially during prompt processing, is well-suited because a large number of operations are performed per memory load
Decoding in language models is less efficient; batching or running many generations at once can better utilize GPU hardware
Smaller models run repeatedly (with multiple output generations and a selection mechanism) can match the quality of larger models and are more hardware-efficient for certain tasks
Modern GPU Features and Optimization Strategies 16:09
Latest GPU tensor cores are optimized for low-precision matrix-matrix multiplication, making certain AI tasks almost "free" if formulated correctly (e.g., multi-token prediction, multi-sample queries)
Matrix-matrix operations maximize GPU throughput; performance drops dramatically if only matrix-vector operations are used
Taking advantage of GPU hardware often involves restructuring problems to increase the number of batched or parallel calculations
Practical advice: running smaller models locally and using batching/sampling with appropriate verification (quality selection) can provide high performance and quality
Open models and improved tools make self-hosted, hardware-aware AI inference increasingly accessible for engineers
A "GPU glossary" resource is available at modal.com/GPU-glossery, offering explanations and links related to GPU architecture and programming
Modal provides a serverless platform for data- and compute-intensive workloads, with infrastructure designed to facilitate efficient AI (language model) inference workflows