The workshop focuses on explaining Mixture of Experts (MoE) architecture and then building a Mixture of Agents (MoA) application, replacing experts with agents.
Attendees will build their own app in a hands-on session.
Daria Soboleva, Head Research Scientist at Cerebras for 4.5 years, focuses on MoE and hardware-efficient LLM training, and created the SlimPajama data set.
Daniel Kim, Head of Growth at Cerebras, handles developer activations, marketing, and startup sales for Cerebras tokens and research projects.
Cerebras is a hardware company that makes custom silicon for running AI models "super duper fast," holding world records in public model hosting.
For Llama 3.370B, Cerebras is 15.5 times faster than the fastest GPU inference provider.
The evolution of LLMs progressed from scaling model size (GPT-3) to improving data quality (Llama) and now to architectural improvements like Mixture of Experts (DeepC3).
MoE addresses the challenge of scaling models further (e.g., from 13B to 600B parameters) by improving efficiency and inference infrastructure.
In a transformer architecture, the feed-forward network is a bottleneck because it must disentangle all information and activate specific neurons for diverse tasks (languages, domains).
MoE solves this by replacing one monolithic feed-forward network with multiple specialized "experts," each handling a specific task (e.g., math problems, biology).
An additional "router" network decides which expert to select for a particular token, allowing models to increase parameters and capacity without increasing inference time, leading to better quality.
MoE is an industry standard used by companies like OpenAI (GPT-4) and Anthropic to scale models and gain better skills efficiently.
Ilya Sutskever's idea of "inference time compute" suggests that after hitting a data wall in pre-training, more compute can be done post-training to get better results.
Complex math problems often require multiple steps of sequential thought and reasoning, making them hard for single, non-reasoning models.
For an Amy math competition problem, GPT-4o (non-reasoning) took 45 seconds for a wrong answer, while GPT-3 (reasoning) took 293 seconds for the correct answer, highlighting the speed bottleneck for real-world applications.
Mixture of Agents (MoA) and Real-World Application 12:17
Mixture of Agents (MoA) leverages the collective intelligence of multiple LLMs to arrive at correct answers, taking advantage of Cerebras's fast inference.
MoA works by sending inputs to multiple LLMs (agents) with custom system prompts, each providing a response, which a final model then combines into a single answer.
This approach has been shown to outperform frontier models on certain benchmarks, as demonstrated by Together AI.
Ninjate.ai, a startup using Cerebras, solved the same complex math problem in 7.4 seconds using a MoA model.
Ninjate.ai's application involves a planning agent that generates proposals, a critique agent that evaluates feasibility, and a summarization agent that combines top candidates into a final answer.
This process, though fast (7 seconds), involved generating over 500,000 tokens and 32 LLM calls (some parallel, some sequential).
MoA systems enable non-frontier or open-source models to perform better than frontier models by covering more surface area and providing comprehensive answers.
Cerebras solves the speed bottleneck that prevents models like GPT-3 from being used in production for synchronous tasks due to their long inference times.
GPUs (e.g., Nvidia H100) have a core for computations but store memory largely off-chip, leading to memory channel bottlenecks as models grow.
Cerebras tackles this with a radically different memory management system: 900,000 individual cores on one chip, each with its own memory store (one-to-one with compute cores).
This on-chip memory eliminates transfer time, allowing real-time computation.
Cerebras scales linearly across larger models, with only activations (a small amount of data) transferred between chips, often via a single Ethernet cord, unlike DGX clusters requiring many links for inter-GPU communication.
Benefits of Mixture of Agents for Complex Problems 18:43
Monolithic LLMs often require continuous prompting and refinement for complex problems, hitting token limits and requiring chat restarts.
Mixture of Agents specializes each agent to solve a particular portion of a complex problem (e.g., different experts for a surgery).
This approach allows for a "zero-shot" solution where one question yields a final answer, as specialized agents combine their results without continuous prompting.
The competitive part of the workshop involves configuring an MoA system to generate Python code for a calculate_user_matrix function, aiming for a maximum score of 120 points in an automated grader.
The challenge requires participants to act as AI prompt engineers and system architects, optimizing the main model, number of cycles (layers), temperature, and prompts for both summarization and individual agents.
Pre-set agents are provided to help with bug-fixing and performance optimization. The goal is to fix bugs and optimize the function using LLMs.
AutoML for MoA: Applying AI to solve multi-hour problems is already happening (e.g., codegen startups like Devin); Cerebras aims to reduce hours to minutes.
Global Distribution: Cerebras has six data centers in the US, with plans to open one in France and Canada, and expects continued global expansion.
Model Onboarding Time: Varies based on model architecture and kernel availability; models similar to existing ones (e.g., Quen 32B to Llama) are fast, while new architectures require custom kernel development.
Power Consumption: Cerebras claims around one-third the power consumption of Nvidia GPUs for equivalent workloads, though their chips are more massive and offer higher throughput.
MoA Benchmarks/Tradeoffs: MoA's performance depends on prompt tuning and system optimization. It's inspired by ensemble learning, where multiple models provide a more robust solution. Too many agents can lead to redundancy and unused agents, increasing time without improving the final solution.
Fine-tuned Models: Cerebras supports custom fine-tuned models for enterprise clients and is working on supporting LoRA fine-tuned models on its roadmap.
Diffusion Models: Currently in the research phase for scaling, Cerebras has seen promising internal demos of diffusion models running on their hardware, but they are not yet in the public inference API.
New Architectures: For new architectures not yet supported, Cerebras collaborates with customer engineers to ensure all necessary kernels are in place, leveraging their hardware's flexible design.
Real-time APIs & Multimodal Models: Cerebras has released its first multimodal API through the Mistral app (some image-based queries run on their hardware) and plans to scale it to their public cloud soon. Real-time use cases are an interesting direction due to their inference speed.
Model Sizes on Cerebras: There is no hard limit on model size, as Cerebras can infinitely add chips and scale linearly from 8 billion parameters to much larger models.
Instance Partitioning: Cerebras currently offers its service via API, handling the load and provisioning systems in the backend based on user workload and rate limits, rather than allowing users to partition single SOCs.