POV: Chinese AI Lab Teaching Everyone How To Save Millions of Dollars
Introduction to Bytedance's AI Lab and Model Merging 00:00
Bytedance's AI lab, Den Seed, is rapidly outpacing other Chinese AI labs, with resources to compete globally.
Their video model CEN 1.0 outperforms Google's new VO3 in video and audio generation benchmarks.
The lab has released numerous influential research papers in recent months.
One key concept explored is model merging, which was previously more common in image generation but is less discussed in large language models (LLMs).
Fundamentals and Challenges of Model Merging in Pre-training 00:48
Model merging during the pre-training stage of LLMs is rare due to the high cost of experiments.
Pre-training runs for large models (e.g., 70B parameters) can cost around $2 million each.
Because of expense and competitive secrecy, labs rarely publish detailed methodologies for pre-training model merging.
Den Seed published a paper detailing their techniques for model merging in pre-training, sharing methods to save millions and recover from failed training runs.
RunPod Sponsorship and Infrastructure for Model Training 02:37
Brief segment featuring RunPod, a service that simplifies AI training deployment with serverless infrastructure.
RunPod’s Hub allows one-click deployment of popular AI repositories and offers a revenue share program for creators.
The service aims to support open-source AI projects and make infrastructure more accessible.
Details of Pre-trained Model Averaging (PMA) 03:53
Typical pre-training uses a warm-up, stable, and annealed (decayed learning rate) schedule.
Den Seed’s model merging method saves checkpoints at regular token intervals and averages them to produce a merged model.
The merged (PMA) model predicts final performance ahead of time, potentially saving 3–6 training days (~15% of compute budget).
PMA was tested on both dense and mixture-of-expert models, ranging from hundreds of millions to hundreds of billions of parameters.
Total experiment cost could reach up to $15 million in GPU time.
Merging during the constant learning rate phase nearly matches or outperforms annealed models, with only minimal gains from prolonged annealing.
Saving on compute: for some models, skipping additional annealing phases saves tens of thousands of dollars with negligible performance difference.
Checkpoint interval should scale with model size for best results.
The paper compared three checkpoint averaging methods: simple moving average (SMA), exponential moving average (EMA), and weighted moving average (WMA). SMA proved most effective due to its simplicity and full window averaging.
Model merging (SMA) effectively reduces high-frequency weight noise, serving as a “low pass filter” to smooth out oscillations from constant learning phases.
The merged checkpoint sits at the “peak” in performance contour plots, visualizing its effectiveness.
Applying both SMA and learning rate decay (annealing) is largely redundant, but annealing can push performance marginally higher.