1K samples vs typical 800K+
Let's analyze their achievement of strong performance with minimal data (1K vs 800K+). Here are the key components that made this possible:
Data Quality Over Quantity:
Their three-stage filtering process was crucial:
class DataSelection:
def quality_filter(self, samples):
# Stage 1: Quality
return [s for s in samples if (
self.is_well_formatted(s) and
self.has_complete_reasoning(s) and
not self.has_errors(s)
)]
def difficulty_filter(self, samples):
# Stage 2: Difficulty
return [s for s in samples if (
not self.is_solvable_by_base_models(s) and
self.has_sufficient_reasoning_length(s)
)]
def diversity_filter(self, samples, target=1000):
# Stage 3: Domain-based diversity
domains = self.classify_domains(samples)
return self.weighted_domain_sampling(domains, target)
Critical Analysis:
- Positive: Their rigorous filtering ensures high-quality training data
- Negative: Might miss valuable but imperfectly formatted samples
High-Quality Reasoning Traces:
They used Gemini's Flash Thinking API to generate reasoning:
def generate_reasoning_trace(question):
"""Each trace follows structured format:
1. Problem breakdown
2. Step-by-step solution
3. Verification
4. Final answer
"""
response = gemini.flash_thinking(question)
return {
'question': question,
'reasoning': response.thinking,
'solution': response.answer
}
Critical Analysis:
- Positive: Consistent, high-quality reasoning patterns
- Negative: Potential bias from single source (Gemini)
Training Methodology:
class TrainingConfig:
def __init__(self):
self.base_model = "Qwen2.5-32B-Instruct"
self.training_time = "26 minutes"
self.gpu_count = 16 # H100 GPUs
self.batch_size = 16
self.learning_rate = 1e-5
self.warmup_steps = "5%"
self.total_steps = 315
Mathematical formulation for their training:
where:
= 1000 (sample size) = reasoning trace + answer = question = model parameters
Critical Analysis:
- Positive: Simple, straightforward fine-tuning approach
- Negative: Limited exploration of training strategies
Performance Comparison:
models_comparison = {
"s1-32B (1K samples)": {
"AIME24": 56.7,
"MATH500": 93.0,
"GPQA": 59.6
},
"r1-distill (800K samples)": {
"AIME24": 72.6,
"MATH500": 94.3,
"GPQA": 62.1
}
}
Key Factors for Efficiency:
Data Selection Impact:
ablation_results = {
"random_1K": {"AIME24": 36.7},
"diverse_only_1K": {"AIME24": 26.7},
"difficult_only_1K": {"AIME24": 33.3},
"s1K (all criteria)": {"AIME24": 50.0}
}
Sample Efficiency Analysis:
Critical Components That Led to Success:
Data Quality:
- Rigorous three-stage filtering
- Focus on difficulty and diversity
- High-quality reasoning traces
Model Architecture:
- Using strong base model (Qwen2.5-32B)
- Maintaining model capacity
- Effective fine-tuning strategy
Training Strategy:
- Short but focused training (26 minutes)
- Appropriate learning rate and batch size
- No complex training schemes
Critical Analysis of Their Approach:
Strengths:
- Sample Efficiency: Achieved competitive results with 0.125% of typical data
- Training Efficiency: Only 26 minutes on 16 GPUs
- Reproducibility: Simple, straightforward methodology
Weaknesses:
- Dependence on Gemini API for reasoning traces
- Limited exploration of alternative training strategies
- Potential brittleness due to small dataset
- May not generalize well to other domains
Areas for Improvement:
- Diversify reasoning trace sources
- Explore data augmentation techniques
- Investigate more sophisticated training methods
- Test robustness across more domains
Insight
The key insight from their work is that carefully curated, high-quality data can be more valuable than large quantities of noisy data. Their success appears to come from:
- Quality of reasoning traces (Gemini API)
- Careful data selection process
- Simple but effective training approach
- Strong base model selection
This suggests that the field might be overemphasizing data quantity over quality in some cases. However, their approach also raises questions about scalability and generalizability that would need to be addressed in future work.