OpenThoughts: Data Recipes for Reasoning Models — Ryan Marten, Bespoke Labs

Introduction & Motivation 00:00

  • Ryan Marten from Bespoke Labs introduces OpenThoughts, a project to create high-quality open-source reasoning data sets.
  • Focus shifts from reinforcement learning (RL) to reasoning in model development.
  • Recent months have seen major performance improvements on reasoning benchmarks, especially in tasks like competitive math.
  • Noted that DeepSeek R1 model’s impressive performance was mainly due to supervised fine-tuning (SFT) using 800K examples (600K related to reasoning).
  • There's a clear "training recipe" for strong reasoning models, but a lack of clarity on the optimal "data recipe".

The Case for Custom Reasoning Models 02:32

  • Training custom reasoning models delivers advantages in performance, privacy, speed, cost, and ownership.
  • RL is effective for reasoning tasks, but SFT is noted as much easier and highly effective for many use cases.

OpenThoughts 3 Dataset & Benchmarking 03:37

  • OpenThoughts 3, the latest version of the reasoning data set, was released just hours before this talk.
  • The data set shows increased accuracy across domains: competitive math (Amy), competitive code (Live Codebench), and science (GPQA diamond).
  • Compared head-to-head against other open reasoning data sets like Neimatron Nano (Nvidia), OpenThoughts’ data recipe shifts the scaling curve upward, improving results at similar data/model sizes.
  • OpenThoughts 3’s 7B model outperforms both DeepSeek R1 Quen 7B and Neimatron Nano in several benchmarks.

Building the Data Recipe: Process and Key Learnings 06:26

  • The data pipeline consists of: sourcing questions, mixing sources, filtering questions, using teacher models for answer generation (distillation), further filtering, and teacher model selection.
  • Over a thousand experiments were performed to empirically select optimal steps at every stage.
  • Rigorously iterated on each pipeline component, revealing key strategies beneficial for building high-quality reasoning data.

Experimental Insights & Surprising Results 08:02

  • Sampling multiple answers (reasoning traces) per question greatly enhances performance; scaling the number of answers yields significant accuracy gains.
  • Model quality as a teacher is not strictly linked to benchmark performance; for example, Quen 32B outperformed DeepSeek R1 as a teacher despite being smaller.
  • Synthetic question sources can outperform human-written or forum-scraped questions, and are highly scalable.
  • Filtering for question difficulty—by LM-predicted difficulty or by response length—improves data quality more effectively than embeddings-based or classifier-based methods.
  • Focusing on a smaller number of high-quality sources is superior to maximizing source diversity.
  • Contrary to common belief, answer verification (filtering on correctness) did not improve SFT/distillation outcomes, possibly because value remains in hard questions with imperfect answers.

Recommendations for Specialized Reasoning Models 11:44

  • For new domains, start with the OpenThoughts recipe and adapt based on the domain’s specifics.
  • Test and iterate specific pipeline steps for your data type, as optimal strategies vary—for instance, filtering by length works well in science/math, by difficulty in code.
  • Synthetic data generation is recommended for domain expansion if native data is lacking; the Curator open-source library is offered for this purpose.
  • Rigorous evaluation is essential; use the Evalchemy library for structured, repeatable evaluations.
  • For small eval sets, run multiple passes and average results to minimize noise and more accurately assess data or model changes.
  • Distillation can sometimes surpass the teacher model in specific domains, as demonstrated in legal reasoning benchmarks.

Q&A and Closing 16:43

  • SFT enables "long thinking" by fine-tuning on questions whose answers include extended reasoning traces; the model learns to generate similar, lengthy outputs.
  • A model's utility as a teacher stems from factors like reasoning trace length and output formatting, not just raw capability.
  • Current analysis does not identify precisely where, in chain-of-thought reasoning, errors occur, though literature suggests failures often happen at critical reasoning steps.
  • All OpenThoughts resources, including the data, weights, and supporting libraries, are open-source and publicly available for community use.
  • The session concludes with acknowledgment of the OpenThoughts team and open invitations for further questions and collaborative development.