Introduction to LLM serving with SGLang - Philip Kiely and Yineng Zhang, Baseten

Workshop Introduction and Overview 00:00

  • Introduction of presenters and brief personal backgrounds
  • Workshop intended to be interactive and tailored to audience questions and needs
  • Main topics: Introduction to SGLang, setup, project history, deploying a model, performance optimization, community involvement, and codebase tour

What is SGLang and Why Use It? 02:04

  • SGLang is an open-source, fast serving framework for large language and vision models
  • Performs well across a variety of GPUs and is production-ready out of the box
  • Day-zero support for new model releases from labs like Qwen and DeepSeek
  • Actively maintained, open to community contributions and bug fixes

SGLang Adoption and History 03:07

  • Used at Baseten as part of their inference stack
  • Adopted by organizations such as XAI (for Grok models), multiple inference and cloud providers, research labs, and product companies
  • Project began with a research paper in December 2023 and achieved nearly 15,000 GitHub stars in 18 months
  • International user base and vibrant community

Project Background and Team 04:13

  • Core maintainers contributed from roles in internal model optimization and inference
  • SGLang uses FlashInfer for attention and sampling kernels
  • Project is affiliated with LLMISOR, recently funded to build Chatbot Arena

Setting Up and Deploying Your First Model 06:42

  • SGLang is typically run as a server command within a Docker container
  • In this workshop, models are packaged using Truss and deployed via Baseten to access GPUs (L4s used for their cost and FP8 support)
  • Allows configuration for different hardware types (H100, H200, Blackwell—coming soon)
  • Emphasis on understanding server command flags and how configurations like quantization and batch size interact
  • Hands-on activity: Participants deploy their first model; need to retrieve and use a unique model ID

Basic Usage and Troubleshooting 10:06

  • Assistance provided for setup issues (e.g., Base10 waiting room, deployment errors)
  • Once deployed, models can be interacted with using sample code or Jupyter notebooks, requiring the model ID
  • Repository with all resources is available on GitHub and will remain accessible for follow-up

Performance Optimization: CUDA Graphs 12:56

  • Demonstration on using CUDA Graphs to improve performance; adjusting CUDA Graph max batch size
  • Default max batch size for L4 is eight; increasing this can allow for larger concurrent batch handling
  • Benchmarking tools (e.g., LMEL) used to monitor request throughput and decode batch status
  • Setting an appropriate max batch size is important for enabling CUDA Graphs during inference; parameter tuning explained

Performance Optimization: Eagle 3 Speculative Decoding 24:17

  • Eagle 3 is a recently released speculative decoding algorithm supported by SGLang
  • Allows configuration of parameters such as the number of tokens to speculate and the depth
  • Eagle's draft model is derived from the target model, not a separate smaller model
  • Examples and scripts provided in the repo for benchmarking and tuning Eagle 3 parameters for optimal production settings
  • Importance of using representative prompts for benchmarking emphasized

Community Involvement and Codebase Introduction 29:40

  • SGLang has an active community presence on GitHub, Twitter, and Slack
  • Contributions encouraged: filing issues, tackling good first issues, participating in meetups
  • Codebase contains runtime, a domain-specific front-end language, and optimized kernel modules
  • Documentation and codebase tours available for onboarding new contributors
  • Specific areas to contribute: kernel development, router/caching logic, runtime features, and support for custom models

Wrap-Up: Invitations and Q&A 35:08

  • Mention of a happy hour event with Oxen AI, highlighting fine-tuning and benchmarking against major models
  • Baseten has open roles for those interested in model performance and infrastructure
  • Q&A covers:
    • Reasons for choosing SGLang (configurability, extensibility, ease of contribution)
    • Security protocols are largely engine-agnostic; containerization and runtime isolation are best practices
    • SG lang is suitable for secure and air-gapped environments because it is open source and fully inspectable
    • Integration with decentralized (blockchain) protocols is not addressed, but SG lang is primarily used in traditional client-server models
    • SGLang is chosen for deep configuration options and because user-contributed fixes can be quickly merged
    • Baseten uses multiple inference runtimes (including VLLM and TensorRT-LLM), choosing based on compatibility and other technical requirements

Closing Remarks 43:10

  • Presenters available at the Baseten booth for further questions and discussions about SGLang and job opportunities
  • Appreciation expressed to participants for attending and engaging