How fast are LLM inference engines anyway? — Charles Frye, Modal
The Evolution of Open Models and Inference Engines 00:19
The AI Engineer Summit has shifted from focusing on OpenAI wrappers to open models like Llama, Quen, and Deep Seek, which have significantly improved in quality, making self-hosting viable.
The software stack for running language models has also advanced rapidly, incorporating complex techniques like KV caching, multi-token prediction, and speculative decoding, which are difficult to implement manually.
The combination of high-quality open-source models and advanced inference engines (VLM, SGLang, Tensor TLM) means there's less need for proprietary models unless specific requirements like air-gapped systems or government use cases exist.
A prediction from 2023 stated that if capability requirements saturate, open models would catch up to and dominate proprietary ones, similar to operating systems or databases.
This prediction has proven true, with open models reaching a "good enough" capability level.
Early LLM inference libraries have mostly disappeared, but VLM has remained a consistent and good option.
Benchmarking LLM Performance with LLM Almanac 04:52
Modal developed a benchmarking tool and the LLM Almanac (modal.com/llmalmanac) to answer common questions about the performance of open models on various engines and context lengths.
The almanac provides detailed benchmarking methodology, open-source code, and an executive summary of results.
The tool allows users to select models and engines to view performance metrics like requests per second and first token latency, with initial results based on out-of-the-box performance.
A significant difference in throughput is observed between "reasoning workloads" (more tokens generated) and "RAG workloads" (more tokens in context).
Workloads dominated by context (e.g., 1024 tokens in, 128 tokens out) show much higher throughput (e.g., 4x improvement) compared to generation-heavy workloads (e.g., 128 tokens in, 1024 tokens out).
This performance difference is attributed to the nature of Transformer architecture, where large matrix multiplications in prefill are more efficient than auto-regressive decoding.
Time to first token latency remains almost identical even with a 10x increase in input tokens, suggesting that maximizing context over generation can be a "free lunch" for improving performance and meeting latency SLAs.
The presented throughput numbers are per replica (per GPU, e.g., one H100), meaning total throughput is achieved by scaling out (adding more GPUs) rather than scaling up.
The benchmarking methodology involves measuring maximum throughput by sending a large batch of requests and allowing the engine to process them with maximum parallelism.
Minimum latency is determined by sending one request at a time and waiting for its completion, with the reported numbers sweeping between these two extremes.