OpenThoughts: Data Recipes for Reasoning Models

Jun 4, 2025·

Etash Guha

Ryan Marten

Sedrick Keh

Negin Raoof

Georgios Smyrnis

Hritik Bansal

Marianna Nezhurina

Jean Mercat

Trung Vu

Zayne Sprague

Ashima Suvarna

Benjamin Feuer

Liangyu Chen

Zaid Khan

Eric Frankel

Sachin Grover

Caroline Choi

Niklas Muennighoff

Shiye Su

Wanjia Zhao

John Yang

Shreyas Pimpalgaonkar

Kartik Sharma

Charlie Cheng-Jie Ji

Yichuan Deng

Sarah Pratt

Vivek Ramanujan

Jon Saad-Falcon

Jeffrey Li

Achal Dave

Alon Albalak

Kushal Arora

Blake Wulfe

Chinmay Hegde

Greg Durrett

Sewoong Oh

Mohit Bansal

Saadia Gabriel

Aditya Grover

Kai-Wei Chang

Vaishaal Shankar

Aaron Gokaslan

Mike A. Merrill

Tatsunori Hashimoto

Yejin Choi

Jenia Jitsev

Reinhard Heckel

Maheswaran Sathiamoorthy

Alexandros G. Dimakis

Ludwig Schmidt

· 0 min read

PDF Cite Code Project DOI

OpenThoughts: Open-source reasoning datasets and models achieving state-of-the-art results.

Abstract

Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThoughts3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond - improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on .

Type

Publication

arXiv preprint

Last updated on Jul 18, 2025

← A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation Jul 7, 2025

Should VLMs be Pre-trained with Image Data? Mar 10, 2025 →