Development of s1K
Let me break down the development of s1K dataset step by step:
Initial Data Collection (59K samples):
- Started with 59,029 questions from 16 diverse sources
- Main data sources included:
- NuminaMATH (30,660 problems)
- MATH (Competition problems)
- OlympicArena (4,250 questions spanning multiple sciences)
- OmniMath (4,238 competition math problems)
- AGIEval (2,385 problems)
- Also includes their own created datasets:
- s1-prob: 182 questions from Stanford Statistics PhD qualifying exams
- s1-teasers: 23 challenging brain-teasers
Three-Stage Filtering Process:
Stage 1: Quality Filtering
- Removed questions with API errors (reduced to 54,116 samples)
- Filtered out formatting issues, ASCII art, etc. (reduced to 51,581)
- Identified 384 high-quality samples from trusted datasets
Stage 2: Difficulty Filtering
Used two methods to assess difficulty:
# Model-based filtering
def assess_difficulty(question):
model1_result = Qwen2.5_7B.evaluate(question)
model2_result = Qwen2.5_32B.evaluate(question)
# Remove if either model can solve it (too easy)
if model1_result.correct or model2_result.correct:
return "too_easy"
# Measure reasoning trace length
token_length = Qwen2.5_tokenizer.count_tokens(question.reasoning_trace)
return token_length
Stage 3: Diversity Selection
def select_diverse_samples(questions, target_size=1000):
domains = classify_into_domains(questions) # Using MSC system
selected = []
while len(selected) < target_size:
# Randomly choose a domain
domain = random.choice(list(domains.keys()))
# Sample from domain favoring longer reasoning traces
questions_in_domain = domains[domain]
weights = calculate_length_based_weights(questions_in_domain)
selected_question = weighted_sample(questions_in_domain, weights)
selected.append(selected_question)
return selected
Note: What is MSC system?
Data Format:
Each sample contains:
- Question
- Reasoning trace (generated by Gemini)
- Solution
Example data structure:
class DataSample:
def __init__(self):
self.question: str # The problem statement
self.reasoning_trace: str # Step-by-step reasoning
self.solution: str # Final answer
self.domain: str # Mathematical domain
self.difficulty_score: float # Based on model performance
self.token_length: int # Length of reasoning trace
Domain Distribution:
- Geometry: 109 questions
- Number theory: 98 questions
- Combinatorics: 75 questions
- Real functions: 43 questions
- Biology: 41 questions
And so on across 50 domains
Quality Control Measures:
def decontaminate_dataset(samples, test_sets):
"""Remove samples that overlap with evaluation sets"""
clean_samples = []
for sample in samples:
if not has_ngram_overlap(sample, test_sets, n=8):
clean_samples.append(sample)
return clean_samples
def validate_quality(sample):
"""Check various quality criteria"""
checks = [
not has_formatting_issues(sample),
not has_ascii_art(sample),
has_consistent_numbering(sample),
reasoning_trace_is_complete(sample)
]
return all(checks)
Technical Implementation Details:
- Used Claude 3.5 Sonnet for domain classification
- Generated reasoning traces using Google Gemini Flash Thinking API
- Implemented 8-gram overlap checking for decontamination
- Used Qwen2.5 tokenizer for length measurements
The final selection algorithm:
def create_s1k_dataset():
# Stage 1: Quality
initial_samples = load_59k_samples()
quality_filtered = filter_quality(initial_samples)
# Stage 2: Difficulty
difficulty_filtered = []
for sample in quality_filtered:
if is_sufficiently_difficult(sample):
difficulty_filtered.append(sample)
# Stage 3: Diversity
final_samples = select_diverse_samples(
difficulty_filtered,
target_size=1000
)
# Final decontamination
final_samples = decontaminate_dataset(
final_samples,
test_sets=[MATH500, GPQA, AIME24]
)
return final_samples
This careful curation process ensures that the final 1,000 samples are:
- High quality and well-formatted
- Sufficiently challenging
- Diverse across mathematical domains
- Not contaminated with test data
- Have complete and coherent reasoning traces