Latent Space Paper Club: AIEWF Special Edition (Test of Time, DeepSeek R1/V3) — VIbhu Sapra

Paper Club Overview and New "Test of Time" Series 00:00

  • The Paper Club has run every week for a year and a half with strong community and author participation, averaging 100 attendees per session, with some events (like DeepSeek V3) exceeding 300.
  • Sessions feature direct interaction with authors from leading organizations (Nvidia, Meta, Amazon, etc.), with presentations and Q&A.
  • The new "Test of Time" Paper Club is launching, focused on foundational AI concepts and core papers necessary for AI engineers.
  • This series will run July–December, aiming to cover 50–100 key papers across 24 weeks, with two to four papers per session.
  • Topics include deep learning foundations (attention, optimization, RL), LLM foundations (RNNs, LSTMs, BERT, GPT2), generative models (Llama 3, DeepSeek), scaling laws, distillation, and specific areas like speech, image, and video generation.
  • There will be both in-person (San Francisco) and remote sections; community contributions and speaker volunteers are encouraged.
  • The curriculum is flexible, with input sought via forms and Discord.

Introducing and Recapping DeepSeek Progress 07:42

  • DeepSeek papers have generated significant attention, with key live discussions and thousands of views.
  • The May 28 DeepSeek R1 update is a notable improvement, not a full V3, but a major upgrade over previous releases.
  • Changes include better post-training, significant advances in reasoning, and improved benchmark scores (AIME score jumped from 70% to 87.5%).

DeepSeek R1 and Model Improvements 09:53

  • DeepSeek V3-level models now match or exceed benchmarks set by proprietary models like OpenAI’s 03 and Google’s Gemini 2.5 in math, coding, and reasoning.
  • The latest iteration doubled the average reasoning token length from 12,000 to 25,000, using more RL in training.
  • Resulting models show an 18% improvement on benchmarks and enhanced structured output (JSON, function calling).
  • Distillation of new reasoning models to smaller architectures (e.g., Quen 3 8B) results in small models performing on par with much larger models (235B parameters).
  • Chain-of-thought distillation from strong base/large models produces efficient, high-performing small models, with up to 10% boosts over previous versions.

Understanding Test-Time Reasoning, Scaling, and RL Methods 16:04

  • The field previously relied on scaling models with more data and compute (Chinchilla scaling laws, large-token runs), which became expensive and hit practical limits.
  • The new paradigm focuses on maximizing reasoning at inference (test) time rather than endless pre-training, shifting compute to where it matters most.
  • DeepSeek pioneered the use of pure RL (gRPO) on base models to develop reasoning, resulting in "emergent" behavior, including reflection and "aha" moments without supervised data.
  • The R10 model, using only RL for reasoning, was strong in reasoning tasks but lacked general assistant capability.
  • R1 is trained in a four-stage pipeline: SFT cold start, RL for reasoning, rejection sampling, and further RL.
  • Distillation (SFT-style) of reasoning traces into models in the Quen and Llama family demonstrates superior performance in small models compared to native RL approaches.

Emergence, "Aha" Moments, and Reflection 35:07

  • As models reason for more tokens, emergent behaviors appear—models display reflection (revisiting/re-evaluating their own steps) and genuine “aha” moments when solving complex problems.
  • RL encourages these emergences; rather than explicit instruction, models develop advanced strategies autonomously when incentivized correctly.

DeepSeek Training, Distillation, and Performance 38:31

  • R10 is obtained by applying RL (no SFT) to base models on verifiable tasks like math and code, optimizing outputs for correctness and formatted thinking (chain-of-thought).
  • R1 builds from R10 by adding SFT, further RL, rejection sampling, and more RL for stability and assistant usefulness.
  • Fine-tuning with human annotation, chain-of-thought SFT, and large-scale RL enables models to combine deep reasoning with general chat/assistant skills.
  • Distillation makes small models (as low as 8B parameters) nearly as performance-competitive as much larger models on reasoning tasks.
  • RL in small models directly is less effective than distillation from strong reasoning models; a cold start with SFT is often critical for success.

Latest DeepSeek Updates, Benchmarking, and Takeaways 42:49

  • May 28 DeepSeek R1 update brought native function calling, clean JSON output, doubled reasoning length, better performance in math/coding, and less hallucination—achieving benchmark parity with OpenAI 03 and Gemini 2.5.
  • Quen 3 8B, distilled from the new reasoning model, now matches the performance of much larger models (Quen 3 reasoning 235B).
  • The open-source release allows for on-device runnable small models with reasoning parity to state-of-the-art offerings.
  • There is still unexplored potential in RL-style distillation on small models.

Future Plans and Community Involvement 47:53

  • "Test of Time" Paper Club will cover foundational papers and key AI engineering topics, open to community suggestions and speakers.
  • All sessions will be streamed and archived, ensuring accessibility and continued discussion.
  • The club continues to encourage open research, model recreation, and community-led benchmarking.
  • Regular weekly Paper Club sessions will continue in parallel.

Final Recap and Acknowledgments 49:52

  • DeepSeek R1 May 28 update reasons for twice as long, reaches performance parity with top closed models, and enables high-quality, small-scale, open-source reasoning models.
  • Community involvement, regular attendees, and volunteer speakers are critical to the Paper Club's ongoing success.
  • Invitation extended to participate, recommend papers, and join the club via Discord or form links.