OpenThoughts 3, the latest version of the reasoning data set, was released just hours before this talk.
The data set shows increased accuracy across domains: competitive math (Amy), competitive code (Live Codebench), and science (GPQA diamond).
Compared head-to-head against other open reasoning data sets like Neimatron Nano (Nvidia), OpenThoughts’ data recipe shifts the scaling curve upward, improving results at similar data/model sizes.
OpenThoughts 3’s 7B model outperforms both DeepSeek R1 Quen 7B and Neimatron Nano in several benchmarks.
Building the Data Recipe: Process and Key Learnings 06:26
The data pipeline consists of: sourcing questions, mixing sources, filtering questions, using teacher models for answer generation (distillation), further filtering, and teacher model selection.
Over a thousand experiments were performed to empirically select optimal steps at every stage.
Rigorously iterated on each pipeline component, revealing key strategies beneficial for building high-quality reasoning data.
Sampling multiple answers (reasoning traces) per question greatly enhances performance; scaling the number of answers yields significant accuracy gains.
Model quality as a teacher is not strictly linked to benchmark performance; for example, Quen 32B outperformed DeepSeek R1 as a teacher despite being smaller.
Synthetic question sources can outperform human-written or forum-scraped questions, and are highly scalable.
Filtering for question difficulty—by LM-predicted difficulty or by response length—improves data quality more effectively than embeddings-based or classifier-based methods.
Focusing on a smaller number of high-quality sources is superior to maximizing source diversity.
Contrary to common belief, answer verification (filtering on correctness) did not improve SFT/distillation outcomes, possibly because value remains in hard questions with imperfect answers.
Recommendations for Specialized Reasoning Models 11:44
For new domains, start with the OpenThoughts recipe and adapt based on the domain’s specifics.
Test and iterate specific pipeline steps for your data type, as optimal strategies vary—for instance, filtering by length works well in science/math, by difficulty in code.
Synthetic data generation is recommended for domain expansion if native data is lacking; the Curator open-source library is offered for this purpose.
Rigorous evaluation is essential; use the Evalchemy library for structured, repeatable evaluations.
For small eval sets, run multiple passes and average results to minimize noise and more accurately assess data or model changes.
Distillation can sometimes surpass the teacher model in specific domains, as demonstrated in legal reasoning benchmarks.
SFT enables "long thinking" by fine-tuning on questions whose answers include extended reasoning traces; the model learns to generate similar, lengthy outputs.
A model's utility as a teacher stems from factors like reasoning trace length and output formatting, not just raw capability.
Current analysis does not identify precisely where, in chain-of-thought reasoning, errors occur, though literature suggests failures often happen at critical reasoning steps.
All OpenThoughts resources, including the data, weights, and supporting libraries, are open-source and publicly available for community use.
The session concludes with acknowledgment of the OpenThoughts team and open invitations for further questions and collaborative development.