SUMM

Bytedance's AI lab, Den Seed, is rapidly outpacing other Chinese AI labs, with resources to compete globally.
Their video model CEN 1.0 outperforms Google's new VO3 in video and audio generation benchmarks.
The lab has released numerous influential research papers in recent months.
One key concept explored is model merging, which was previously more common in image generation but is less discussed in large language models (LLMs).

Model merging during the pre-training stage of LLMs is rare due to the high cost of experiments.
Pre-training runs for large models (e.g., 70B parameters) can cost around $2 million each.
Because of expense and competitive secrecy, labs rarely publish detailed methodologies for pre-training model merging.
Den Seed published a paper detailing their techniques for model merging in pre-training, sharing methods to save millions and recover from failed training runs.

Brief segment featuring RunPod, a service that simplifies AI training deployment with serverless infrastructure.
RunPod’s Hub allows one-click deployment of popular AI repositories and offers a revenue share program for creators.
The service aims to support open-source AI projects and make infrastructure more accessible.

Typical pre-training uses a warm-up, stable, and annealed (decayed learning rate) schedule.
Den Seed’s model merging method saves checkpoints at regular token intervals and averages them to produce a merged model.
The merged (PMA) model predicts final performance ahead of time, potentially saving 3–6 training days (~15% of compute budget).
PMA was tested on both dense and mixture-of-expert models, ranging from hundreds of millions to hundreds of billions of parameters.
Total experiment cost could reach up to $15 million in GPU time.

Merging during the constant learning rate phase nearly matches or outperforms annealed models, with only minimal gains from prolonged annealing.
Saving on compute: for some models, skipping additional annealing phases saves tens of thousands of dollars with negligible performance difference.
Checkpoint interval should scale with model size for best results.
The paper compared three checkpoint averaging methods: simple moving average (SMA), exponential moving average (EMA), and weighted moving average (WMA). SMA proved most effective due to its simplicity and full window averaging.

Model merging (SMA) effectively reduces high-frequency weight noise, serving as a “low pass filter” to smooth out oscillations from constant learning phases.
The merged checkpoint sits at the “peak” in performance contour plots, visualizing its effectiveness.
Applying both SMA and learning rate decay (annealing) is largely redundant, but annealing can push performance marginally higher.

PMA gives early, reliable performance estimates and minor accuracy boosts (3–7%) at low cost.
Useful for saving 10–20% of budget/time during hyperparameter sweeps or scaling law tests.
Provides stability, allowing recovery from training crashes by merging recent stable checkpoints.
Helps in stabilizing noisy setups (e.g., large batch, mixed-precision, distributed compute).
Flexible and broadly applicable, especially for distributed training by smoothing noise across multiple workers.

PMA and model merging are projected to have a significant, foundational impact on model training.
The technique is especially relevant as large-scale pre-training continues, notably highlighted by efforts to move training data internationally.
Viewers are encouraged to follow the creator’s newsletter for up-to-date AI research and papers.

POV: Chinese AI Lab Teaching Everyone How To Save Millions of Dollars