SUMM

Introduction of presenters and brief personal backgrounds
Workshop intended to be interactive and tailored to audience questions and needs
Main topics: Introduction to SGLang, setup, project history, deploying a model, performance optimization, community involvement, and codebase tour

SGLang is an open-source, fast serving framework for large language and vision models
Performs well across a variety of GPUs and is production-ready out of the box
Day-zero support for new model releases from labs like Qwen and DeepSeek
Actively maintained, open to community contributions and bug fixes

Used at Baseten as part of their inference stack
Adopted by organizations such as XAI (for Grok models), multiple inference and cloud providers, research labs, and product companies
Project began with a research paper in December 2023 and achieved nearly 15,000 GitHub stars in 18 months
International user base and vibrant community

Core maintainers contributed from roles in internal model optimization and inference
SGLang uses FlashInfer for attention and sampling kernels
Project is affiliated with LLMISOR, recently funded to build Chatbot Arena

SGLang is typically run as a server command within a Docker container
In this workshop, models are packaged using Truss and deployed via Baseten to access GPUs (L4s used for their cost and FP8 support)
Allows configuration for different hardware types (H100, H200, Blackwell—coming soon)
Emphasis on understanding server command flags and how configurations like quantization and batch size interact
Hands-on activity: Participants deploy their first model; need to retrieve and use a unique model ID

Assistance provided for setup issues (e.g., Base10 waiting room, deployment errors)
Once deployed, models can be interacted with using sample code or Jupyter notebooks, requiring the model ID
Repository with all resources is available on GitHub and will remain accessible for follow-up

Demonstration on using CUDA Graphs to improve performance; adjusting CUDA Graph max batch size
Default max batch size for L4 is eight; increasing this can allow for larger concurrent batch handling
Benchmarking tools (e.g., LMEL) used to monitor request throughput and decode batch status
Setting an appropriate max batch size is important for enabling CUDA Graphs during inference; parameter tuning explained

Eagle 3 is a recently released speculative decoding algorithm supported by SGLang
Allows configuration of parameters such as the number of tokens to speculate and the depth
Eagle's draft model is derived from the target model, not a separate smaller model
Examples and scripts provided in the repo for benchmarking and tuning Eagle 3 parameters for optimal production settings
Importance of using representative prompts for benchmarking emphasized

SGLang has an active community presence on GitHub, Twitter, and Slack
Contributions encouraged: filing issues, tackling good first issues, participating in meetups
Codebase contains runtime, a domain-specific front-end language, and optimized kernel modules
Documentation and codebase tours available for onboarding new contributors
Specific areas to contribute: kernel development, router/caching logic, runtime features, and support for custom models

Mention of a happy hour event with Oxen AI, highlighting fine-tuning and benchmarking against major models
Baseten has open roles for those interested in model performance and infrastructure
Q&A covers:
- Reasons for choosing SGLang (configurability, extensibility, ease of contribution)
- Security protocols are largely engine-agnostic; containerization and runtime isolation are best practices
- SG lang is suitable for secure and air-gapped environments because it is open source and fully inspectable
- Integration with decentralized (blockchain) protocols is not addressed, but SG lang is primarily used in traditional client-server models
- SGLang is chosen for deep configuration options and because user-contributed fixes can be quickly merged
- Baseten uses multiple inference runtimes (including VLLM and TensorRT-LLM), choosing based on compatibility and other technical requirements

Presenters available at the Baseten booth for further questions and discussions about SGLang and job opportunities
Appreciation expressed to participants for attending and engaging

Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten