SUMM

Daniel Han opens the workshop, referencing his open source work, community involvement, and contributions to major AI model bug fixes.
Encourages attendees to use free GPU resources from Google Colab and Kaggle for model experimentation.
Introduces the session as a deep dive into reinforcement learning (RL), kernels, agents, and quantization for AI models.

Reviews the impact of Llama's leak on boosting the open source LLM movement.
Initial Llama models were trained on fewer tokens compared to recent models (e.g., Llama 1: 1.4T tokens vs. Llama 4: 30T).
Open source and closed source models had similar performance until September 2024 when closed-source models temporarily pulled ahead.
"Open source drought" referenced periods where open source lagged behind new closed-source breakthroughs.
Noteworthy jump in capabilities when models adopted better supervised fine-tuning (SFT) and reinforcement learning (RL).
Cites Yann LeCun's "AI Cake" analogy: pre-training as cake, SFT as icing, RL as cherry on top.
Large models start from a pre-trained base and improve through multiple training stages: pre-training, fine-tuning, and RL.

Highlights community need to standardize model naming (e.g., “instruct”, “PT”, “base”).
Describes a typical LLM lifecycle: pre-training → mid-training (higher quality data/context extension) → supervised fine-tuning → post-training (preference or RL-based).
Recent innovation includes direct RL with verifiable rewards (RLVR), potentially bypassing intermediate stages.
All training is framed as an optimization problem, moving model weights from random init to optimal performance along defined "dots" (phases).

Defines agents as entities taking actions within environments to maximize rewards (classic RL loop).
Adjusts the RL paradigm for LLMs: often lacks continuous environment states, so reward design is critical.
For tasks like math, reward functions can be as simple as assigning higher rewards to correct answers and scaled/penalized rewards elsewhere.
RL for LLMs seeks to increase frequency of correct/desired outputs and penalize off-target ones.
Introduces preferences for binary or distance-based rewards, and the trade-offs in evaluative granularity.

Recaps the transition from base to chat models via SFT, preference tuning, and RL.
Discusses reward model strategies: LLM-as-judge, regex checks, code execution validation.
RL in LLMs doesn't always rely on full turn-by-turn memory (single-turn vs. multi-turn scenarios).
Reveals the empirical and sometimes arbitrary nature of reward function design.
Emphasizes efficiency and resource considerations when applying RL at scale.

Details the fundamental RL goal: maximize expected reward via action selection policies.
Shows step-by-step example with Pac-Man to illustrate reward assignment and gradient calculation.
Explains policy optimization (PPO), likelihood ratios, clipping/trust regions, and KL divergence to combat overfitting and reward hacking.
Key: PPO and its variants introduce mechanisms (like KL penalties and clipped objectives) to keep model changes stable and controlled.

GRPO (Group Relative Policy Optimization) described: further streamlines RL by removing explicit value/reward models and calculating group statistics (e.g., z-scores).
Emphasizes the importance of sampling diversity (temperature, top-p settings) to stimulate model exploration and avoid output monotony.
Recognizes the empirical/luck-driven aspect—positive rewards may appear sporadically and are amplified over time.
Priming via SFT or a small proxy dataset is recommended to avoid the model getting stuck with zero rewards in early RL stages.

Reward function creation is described as the hardest and most critical step for effective RL.
For math/coding, verifiable checks (distance-based, code correctness) are feasible; for open-ended tasks (summarization, legal reasoning), human or LLM-based scoring is necessary but problematic.
Open-source community progress depends on development and sharing of high-quality, diverse reward functions (plus access to compute).
Ultimate impact of RL may hinge on whether capabilities are already latent in base models (accentuating) or if RL is generating genuinely new skills (learning).

Demonstrates use of Unsloth library with VLLM backend for fast, memory-efficient RL on open models using free Colab/Kaggle resources.
Details setup: loading models with quantization, applying parameter-efficient fine-tuning (LoRA), and customizing system and chat templates.
Shows reward function scripts for format verification (via regex), distance-based math reward, etc., and how they influence the RL process.
Shares empirical training logs: most rewards may be poor initially, but with enough steps, the frequency of positive, correct outputs increases.
Shows the impact of priming/supervised steps to jumpstart RL progress.

Quantization can significantly reduce model size (up to 8x), with minor accuracy loss when applied correctly to the right layers.
Dynamic quantization strategies are recommended—some layers (e.g., attention/shared experts) must remain higher precision.
Presents empirical and community findings: float formats down to FP4 are growing in popularity for latest hardware, marking a possible precision floor for further efficiency gains.

For optimal training and inference speed, recommends using torch.compile and tuning its numerous configuration options.
Efficiency improvements in kernels, memory use, and distributed computation are continually advancing the field, albeit most gains now come from smarter handling, not just faster hardware.

Encourages contributions to open-source reward functions and use of shared resources (Unsloth notebooks, Discord, GitHub).
Addresses questions on model initialization, training tricks, and handling resource limits.
Suggests not obsessing over latest academic papers; focus should be on practical, empirical iteration, efficiency, and sharing.
Workshop concludes with offer for further questions, reference to available slides, and distribution of stickers for the community.

[Full Workshop] Reinforcement Learning, Kernels, Reasoning, Quantization & Agents — Daniel Han