Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten
Workshop Introduction and Overview 00:00
- Introduction of presenters and brief personal backgrounds
- Workshop intended to be interactive and tailored to audience questions and needs
- Main topics: Introduction to SGLang, setup, project history, deploying a model, performance optimization, community involvement, and codebase tour
What is SGLang and Why Use It? 02:04
- SGLang is an open-source, fast serving framework for large language and vision models
- Performs well across a variety of GPUs and is production-ready out of the box
- Day-zero support for new model releases from labs like Qwen and DeepSeek
- Actively maintained, open to community contributions and bug fixes
SGLang Adoption and History 03:07
- Used at Baseten as part of their inference stack
- Adopted by organizations such as XAI (for Grok models), multiple inference and cloud providers, research labs, and product companies
- Project began with a research paper in December 2023 and achieved nearly 15,000 GitHub stars in 18 months
- International user base and vibrant community
Project Background and Team 04:13
- Core maintainers contributed from roles in internal model optimization and inference
- SGLang uses FlashInfer for attention and sampling kernels
- Project is affiliated with LLMISOR, recently funded to build Chatbot Arena
Setting Up and Deploying Your First Model 06:42
- SGLang is typically run as a server command within a Docker container
- In this workshop, models are packaged using Truss and deployed via Baseten to access GPUs (L4s used for their cost and FP8 support)
- Allows configuration for different hardware types (H100, H200, Blackwell—coming soon)
- Emphasis on understanding server command flags and how configurations like quantization and batch size interact
- Hands-on activity: Participants deploy their first model; need to retrieve and use a unique model ID
Basic Usage and Troubleshooting 10:06
- Assistance provided for setup issues (e.g., Base10 waiting room, deployment errors)
- Once deployed, models can be interacted with using sample code or Jupyter notebooks, requiring the model ID
- Repository with all resources is available on GitHub and will remain accessible for follow-up
Performance Optimization: CUDA Graphs 12:56
- Demonstration on using CUDA Graphs to improve performance; adjusting CUDA Graph max batch size
- Default max batch size for L4 is eight; increasing this can allow for larger concurrent batch handling
- Benchmarking tools (e.g., LMEL) used to monitor request throughput and decode batch status
- Setting an appropriate max batch size is important for enabling CUDA Graphs during inference; parameter tuning explained
Performance Optimization: Eagle 3 Speculative Decoding 24:17
- Eagle 3 is a recently released speculative decoding algorithm supported by SGLang
- Allows configuration of parameters such as the number of tokens to speculate and the depth
- Eagle's draft model is derived from the target model, not a separate smaller model
- Examples and scripts provided in the repo for benchmarking and tuning Eagle 3 parameters for optimal production settings
- Importance of using representative prompts for benchmarking emphasized
Community Involvement and Codebase Introduction 29:40
- SGLang has an active community presence on GitHub, Twitter, and Slack
- Contributions encouraged: filing issues, tackling good first issues, participating in meetups
- Codebase contains runtime, a domain-specific front-end language, and optimized kernel modules
- Documentation and codebase tours available for onboarding new contributors
- Specific areas to contribute: kernel development, router/caching logic, runtime features, and support for custom models
Wrap-Up: Invitations and Q&A 35:08
- Mention of a happy hour event with Oxen AI, highlighting fine-tuning and benchmarking against major models
- Baseten has open roles for those interested in model performance and infrastructure
- Q&A covers:
- Reasons for choosing SGLang (configurability, extensibility, ease of contribution)
- Security protocols are largely engine-agnostic; containerization and runtime isolation are best practices
- SG lang is suitable for secure and air-gapped environments because it is open source and fully inspectable
- Integration with decentralized (blockchain) protocols is not addressed, but SG lang is primarily used in traditional client-server models
- SGLang is chosen for deep configuration options and because user-contributed fixes can be quickly merged
- Baseten uses multiple inference runtimes (including VLLM and TensorRT-LLM), choosing based on compatibility and other technical requirements
Closing Remarks 43:10
- Presenters available at the Baseten booth for further questions and discussions about SGLang and job opportunities
- Appreciation expressed to participants for attending and engaging