s1 Simple test-time scaling Outline
Core Objective
- Develop a simple approach to achieve test-time scaling and strong reasoning performance
- Replicate OpenAI's o1 model capabilities but with simpler methodology
- Achieve better sample efficiency than existing approaches
Main Contributions
- Development of s1K: A carefully curated dataset of 1,000 questions with reasoning traces
- Introduction of budget forcing: A simple technique to control test-time compute
- Achievement of strong performance with minimal training data (1K samples vs typical 800K+)
Key Components
a) Data Curation (s1K)
- Three main criteria: difficulty, diversity, and quality
- Started with 59K questions, filtered down to 1,000
- Spans 50 different domains
b) Budget Forcing Technique
- Controls test-time compute by:
- Forcefully terminating thinking when needed
- Extending thinking by appending "Wait"
- Allows for extrapolation beyond base performance
Results & Performance
- Outperforms o1-preview on competition math questions (up to 27% on AIME24)
- Shows clear test-time scaling behavior
- Achieves state-of-the-art sample efficiency
- Successfully scales from 50% to 57% on AIME24 with budget forcing
Methodology Validation
- Extensive ablation studies on both data selection and test-time scaling
- Demonstrates importance of combining all three data criteria
- Shows effectiveness of budget forcing compared to other approaches
Technical Implementation
- Uses Qwen2.5-32B-Instruct as base model
- Training takes only 26 minutes on 16 H100 GPUs
- Tests on three main benchmarks: AIME24, MATH500, and GPQA Diamond