How fast are LLM inference engines anyway? — Charles Frye, Modal

The Evolution of Open Models and Inference Engines 00:19

  • The AI Engineer Summit has shifted from focusing on OpenAI wrappers to open models like Llama, Quen, and Deep Seek, which have significantly improved in quality, making self-hosting viable.
  • The software stack for running language models has also advanced rapidly, incorporating complex techniques like KV caching, multi-token prediction, and speculative decoding, which are difficult to implement manually.
  • The combination of high-quality open-source models and advanced inference engines (VLM, SGLang, Tensor TLM) means there's less need for proprietary models unless specific requirements like air-gapped systems or government use cases exist.

Past Predictions and Current Reality 03:06

  • A prediction from 2023 stated that if capability requirements saturate, open models would catch up to and dominate proprietary ones, similar to operating systems or databases.
  • This prediction has proven true, with open models reaching a "good enough" capability level.
  • Early LLM inference libraries have mostly disappeared, but VLM has remained a consistent and good option.

Benchmarking LLM Performance with LLM Almanac 04:52

  • Modal developed a benchmarking tool and the LLM Almanac (modal.com/llmalmanac) to answer common questions about the performance of open models on various engines and context lengths.
  • The almanac provides detailed benchmarking methodology, open-source code, and an executive summary of results.
  • The tool allows users to select models and engines to view performance metrics like requests per second and first token latency, with initial results based on out-of-the-box performance.

Key Performance Insights 10:46

  • A significant difference in throughput is observed between "reasoning workloads" (more tokens generated) and "RAG workloads" (more tokens in context).
  • Workloads dominated by context (e.g., 1024 tokens in, 128 tokens out) show much higher throughput (e.g., 4x improvement) compared to generation-heavy workloads (e.g., 128 tokens in, 1024 tokens out).
  • This performance difference is attributed to the nature of Transformer architecture, where large matrix multiplications in prefill are more efficient than auto-regressive decoding.
  • Time to first token latency remains almost identical even with a 10x increase in input tokens, suggesting that maximizing context over generation can be a "free lunch" for improving performance and meeting latency SLAs.

Benchmarking Methodology and Scaling 14:39

  • The presented throughput numbers are per replica (per GPU, e.g., one H100), meaning total throughput is achieved by scaling out (adding more GPUs) rather than scaling up.
  • The benchmarking methodology involves measuring maximum throughput by sending a large batch of requests and allowing the engine to process them with maximum parallelism.
  • Minimum latency is determined by sending one request at a time and waiting for its completion, with the reported numbers sweeping between these two extremes.