[Full Workshop] Reinforcement Learning, Kernels, Reasoning, Quantization & Agents — Daniel Han

Introduction and Background 00:00

  • Daniel Han opens the workshop, referencing his open source work, community involvement, and contributions to major AI model bug fixes.
  • Encourages attendees to use free GPU resources from Google Colab and Kaggle for model experimentation.
  • Introduces the session as a deep dive into reinforcement learning (RL), kernels, agents, and quantization for AI models.

History and Evolution of Open Source AI Models 03:23

  • Reviews the impact of Llama's leak on boosting the open source LLM movement.
  • Initial Llama models were trained on fewer tokens compared to recent models (e.g., Llama 1: 1.4T tokens vs. Llama 4: 30T).
  • Open source and closed source models had similar performance until September 2024 when closed-source models temporarily pulled ahead.
  • "Open source drought" referenced periods where open source lagged behind new closed-source breakthroughs.
  • Noteworthy jump in capabilities when models adopted better supervised fine-tuning (SFT) and reinforcement learning (RL).
  • Cites Yann LeCun's "AI Cake" analogy: pre-training as cake, SFT as icing, RL as cherry on top.
  • Large models start from a pre-trained base and improve through multiple training stages: pre-training, fine-tuning, and RL.

Model Terminology and Training Stages 12:00

  • Highlights community need to standardize model naming (e.g., “instruct”, “PT”, “base”).
  • Describes a typical LLM lifecycle: pre-training → mid-training (higher quality data/context extension) → supervised fine-tuning → post-training (preference or RL-based).
  • Recent innovation includes direct RL with verifiable rewards (RLVR), potentially bypassing intermediate stages.
  • All training is framed as an optimization problem, moving model weights from random init to optimal performance along defined "dots" (phases).

Agents and Reinforcement Learning Basics 16:50

  • Defines agents as entities taking actions within environments to maximize rewards (classic RL loop).
  • Adjusts the RL paradigm for LLMs: often lacks continuous environment states, so reward design is critical.
  • For tasks like math, reward functions can be as simple as assigning higher rewards to correct answers and scaled/penalized rewards elsewhere.
  • RL for LLMs seeks to increase frequency of correct/desired outputs and penalize off-target ones.
  • Introduces preferences for binary or distance-based rewards, and the trade-offs in evaluative granularity.

RL Application to Language Models 22:00

  • Recaps the transition from base to chat models via SFT, preference tuning, and RL.
  • Discusses reward model strategies: LLM-as-judge, regex checks, code execution validation.
  • RL in LLMs doesn't always rely on full turn-by-turn memory (single-turn vs. multi-turn scenarios).
  • Reveals the empirical and sometimes arbitrary nature of reward function design.
  • Emphasizes efficiency and resource considerations when applying RL at scale.

Mathematical Formulations and RL Implementation 51:00

  • Details the fundamental RL goal: maximize expected reward via action selection policies.
  • Shows step-by-step example with Pac-Man to illustrate reward assignment and gradient calculation.
  • Explains policy optimization (PPO), likelihood ratios, clipping/trust regions, and KL divergence to combat overfitting and reward hacking.
  • Key: PPO and its variants introduce mechanisms (like KL penalties and clipped objectives) to keep model changes stable and controlled.

Modern RL Algorithms and Practicalities 76:24

  • GRPO (Group Relative Policy Optimization) described: further streamlines RL by removing explicit value/reward models and calculating group statistics (e.g., z-scores).
  • Emphasizes the importance of sampling diversity (temperature, top-p settings) to stimulate model exploration and avoid output monotony.
  • Recognizes the empirical/luck-driven aspect—positive rewards may appear sporadically and are amplified over time.
  • Priming via SFT or a small proxy dataset is recommended to avoid the model getting stuck with zero rewards in early RL stages.

Reward Function Engineering and Open Source Specialization 91:00

  • Reward function creation is described as the hardest and most critical step for effective RL.
  • For math/coding, verifiable checks (distance-based, code correctness) are feasible; for open-ended tasks (summarization, legal reasoning), human or LLM-based scoring is necessary but problematic.
  • Open-source community progress depends on development and sharing of high-quality, diverse reward functions (plus access to compute).
  • Ultimate impact of RL may hinge on whether capabilities are already latent in base models (accentuating) or if RL is generating genuinely new skills (learning).

Notebook Walkthrough: Hands-on GPO Implementation 120:13

  • Demonstrates use of Unsloth library with VLLM backend for fast, memory-efficient RL on open models using free Colab/Kaggle resources.
  • Details setup: loading models with quantization, applying parameter-efficient fine-tuning (LoRA), and customizing system and chat templates.
  • Shows reward function scripts for format verification (via regex), distance-based math reward, etc., and how they influence the RL process.
  • Shares empirical training logs: most rewards may be poor initially, but with enough steps, the frequency of positive, correct outputs increases.
  • Shows the impact of priming/supervised steps to jumpstart RL progress.

Quantization: Theory and Practice 152:31

  • Quantization can significantly reduce model size (up to 8x), with minor accuracy loss when applied correctly to the right layers.
  • Dynamic quantization strategies are recommended—some layers (e.g., attention/shared experts) must remain higher precision.
  • Presents empirical and community findings: float formats down to FP4 are growing in popularity for latest hardware, marking a possible precision floor for further efficiency gains.

Kernel and Efficiency Optimizations 160:51

  • For optimal training and inference speed, recommends using torch.compile and tuning its numerous configuration options.
  • Efficiency improvements in kernels, memory use, and distributed computation are continually advancing the field, albeit most gains now come from smarter handling, not just faster hardware.

Closing Q&A and Community Advice 162:02

  • Encourages contributions to open-source reward functions and use of shared resources (Unsloth notebooks, Discord, GitHub).
  • Addresses questions on model initialization, training tricks, and handling resource limits.
  • Suggests not obsessing over latest academic papers; focus should be on practical, empirical iteration, efficiency, and sharing.
  • Workshop concludes with offer for further questions, reference to available slides, and distribution of stickers for the community.