s1 Simple test-time scaling Outline

Core Objective

Develop a simple approach to achieve test-time scaling and strong reasoning performance
Replicate OpenAI's o1 model capabilities but with simpler methodology
Achieve better sample efficiency than existing approaches

Main Contributions

Development of s1K: A carefully curated dataset of 1,000 questions with reasoning traces
Introduction of budget forcing: A simple technique to control test-time compute
Achievement of strong performance with minimal training data (1K samples vs typical 800K+)

Key Components

a) Data Curation (s1K)

Three main criteria: difficulty, diversity, and quality
Started with 59K questions, filtered down to 1,000
Spans 50 different domains

b) Budget Forcing Technique

Controls test-time compute by:
- Forcefully terminating thinking when needed
- Extending thinking by appending "Wait"
Allows for extrapolation beyond base performance

Results & Performance

Outperforms o1-preview on competition math questions (up to 27% on AIME24)
Shows clear test-time scaling behavior
Achieves state-of-the-art sample efficiency
Successfully scales from 50% to 57% on AIME24 with budget forcing

Methodology Validation

Extensive ablation studies on both data selection and test-time scaling
Demonstrates importance of combining all three data criteria
Shows effectiveness of budget forcing compared to other approaches

Technical Implementation

Uses Qwen2.5-32B-Instruct as base model
Training takes only 26 minutes on 16 H100 GPUs
Tests on three main benchmarks: AIME24, MATH500, and GPQA Diamond