General Intelligence is Multimodal — Keegan McCallum, Luma AI

Dream Machine Launch & Initial Scaling Challenges 00:03

  • Luma launched Dream Machine, their first video model, on June 11th, 2024, expecting significant user interest
  • They initially allocated 500 H100 GPUs, but demand exceeded expectations quickly, creating a queue of nearly 100,000 requests
  • Manually scaled up to 5,000 H100 GPUs in six hours, with request queues draining by around 2 p.m. that day
  • After a public statement about improved speed, demand surged again, forcing them to appropriate their entire training cluster (another 4,000 H100 GPUs), which still wasn't enough to clear the queues
  • Managed to reach one million users within four days, surpassing ChatGPT's five-day record for this milestone
  • Processed around half a million videos in the initial 12 hours

Luma Overview & Features 03:09

  • Luma is a foundation model lab focused on building general multimodal intelligence capable of understanding and operating in the physical world
  • Showcased a new "modify video" feature: users upload iPhone-shot videos, prompt changes with text, and transform video content as desired
  • The API is designed for easy integration; no complex prompt engineering required, and developers can use the SDK to access generative media features

Infrastructure Evolution & Serving Stack 04:30

  • Initial serving stack relied on tightly coupled containers, enabling quick deployment without dependencies, but posed scaling and reliability issues
  • Attempted to use Triton inference server, but faced brittleness, lack of support for non-Nvidia chipsets, and developer inconvenience
  • Migrated to a custom serving stack built on vanilla PyTorch, maximizing hardware compatibility and flexibility for different chipsets
  • Decoupled architecture allowed CPU workers to queue tasks separately from GPUs, reducing GPU idle time and enabling easy scaling across various providers
  • With tools like TailScale and SeaweedFS, GPUs could be added from varied sources without complex provisioning

Scaling Obstacles & Scheduler Adjustments 07:43

  • Encountered issues with back pressure: too many CPU workers pulling jobs into one cluster could block workloads
  • Implemented dispatch limitations to avoid cluster congestion
  • Developed a fair scheduling system for prioritizing jobs among various user tiers (API, enterprise, unlimited, light/free) to address work starvation
  • Video models are large, comprising many sub-models, and require significant warmup, making traditional autoscaling wasteful
  • Built an architecture leveraging Slurm-based scheduling and pull-based systems for dynamic GPU pool expansion and contraction, suiting different load scenarios

Fair Scheduling System & SLO Management 10:15

  • Scaling up for larger models was limited by resource constraints, necessitating smarter scheduling
  • Naive priority-based methods led to lower-tier users experiencing excessive delays (up to 9 hours)
  • Introduced a service level objective (SLO) system: each job tier has a maximum acceptable wait time, and jobs move forward in the queue as their waiting time approaches the SLO
  • Solved possible starvation by ranking jobs by the percentage of their SLO used, equalizing urgency across tiers
  • Result was more intuitive and equitable scheduling, balancing user expectations with limited compute

Model Management & Deployment 13:27

  • Leveraged a model repository paradigm, where each model is stored as a folder with subfolders for immutable versions and a YAML file designating the active one
  • Each model version includes code, environment dependencies, and checkpoints, ensuring reproducibility and easy rollback
  • Automated rollout system allows workers to upgrade models in place by updating the YAML file, enabling smooth, scalable deployments across thousands of GPUs

Collaboration with Chipset Providers & Hardware Agnosticism 16:08

  • Close partnerships with major chip vendors (Nvidia, AMD) ensure PyTorch compatibility and optimized performance across chipsets
  • Optimization team at Luma works on low-level PyTorch operations and coordinates with vendors for efficiency improvements
  • Exploring additional hardware options like Groq and Amazon custom chips, with new partnerships expanding these efforts

Additional Technical Details & Q&A 17:33

  • Use cloud providers’ managed Kubernetes clusters, not direct hardware (bare metal) provisioning, though some variability exists per provider
  • The application pipeline incorporates multimodal components—image and video quality assurance and captioning—but not full visual question answering (VQA) yet
  • Indication of potential future advances in multimodal model capabilities