Luma launched Dream Machine, their first video model, on June 11th, 2024, expecting significant user interest
They initially allocated 500 H100 GPUs, but demand exceeded expectations quickly, creating a queue of nearly 100,000 requests
Manually scaled up to 5,000 H100 GPUs in six hours, with request queues draining by around 2 p.m. that day
After a public statement about improved speed, demand surged again, forcing them to appropriate their entire training cluster (another 4,000 H100 GPUs), which still wasn't enough to clear the queues
Managed to reach one million users within four days, surpassing ChatGPT's five-day record for this milestone
Processed around half a million videos in the initial 12 hours
Initial serving stack relied on tightly coupled containers, enabling quick deployment without dependencies, but posed scaling and reliability issues
Attempted to use Triton inference server, but faced brittleness, lack of support for non-Nvidia chipsets, and developer inconvenience
Migrated to a custom serving stack built on vanilla PyTorch, maximizing hardware compatibility and flexibility for different chipsets
Decoupled architecture allowed CPU workers to queue tasks separately from GPUs, reducing GPU idle time and enabling easy scaling across various providers
With tools like TailScale and SeaweedFS, GPUs could be added from varied sources without complex provisioning
Encountered issues with back pressure: too many CPU workers pulling jobs into one cluster could block workloads
Implemented dispatch limitations to avoid cluster congestion
Developed a fair scheduling system for prioritizing jobs among various user tiers (API, enterprise, unlimited, light/free) to address work starvation
Video models are large, comprising many sub-models, and require significant warmup, making traditional autoscaling wasteful
Built an architecture leveraging Slurm-based scheduling and pull-based systems for dynamic GPU pool expansion and contraction, suiting different load scenarios
Scaling up for larger models was limited by resource constraints, necessitating smarter scheduling
Naive priority-based methods led to lower-tier users experiencing excessive delays (up to 9 hours)
Introduced a service level objective (SLO) system: each job tier has a maximum acceptable wait time, and jobs move forward in the queue as their waiting time approaches the SLO
Solved possible starvation by ranking jobs by the percentage of their SLO used, equalizing urgency across tiers
Result was more intuitive and equitable scheduling, balancing user expectations with limited compute
Leveraged a model repository paradigm, where each model is stored as a folder with subfolders for immutable versions and a YAML file designating the active one
Each model version includes code, environment dependencies, and checkpoints, ensuring reproducibility and easy rollback
Automated rollout system allows workers to upgrade models in place by updating the YAML file, enabling smooth, scalable deployments across thousands of GPUs
Collaboration with Chipset Providers & Hardware Agnosticism 16:08
Close partnerships with major chip vendors (Nvidia, AMD) ensure PyTorch compatibility and optimized performance across chipsets
Optimization team at Luma works on low-level PyTorch operations and coordinates with vendors for efficiency improvements
Exploring additional hardware options like Groq and Amazon custom chips, with new partnerships expanding these efforts
Use cloud providers’ managed Kubernetes clusters, not direct hardware (bare metal) provisioning, though some variability exists per provider
The application pipeline incorporates multimodal components—image and video quality assurance and captioning—but not full visual question answering (VQA) yet
Indication of potential future advances in multimodal model capabilities