Development of s1K

Let me break down the development of s1K dataset step by step:

Initial Data Collection (59K samples):

Three-Stage Filtering Process:

Stage 1: Quality Filtering

Stage 2: Difficulty Filtering

Used two methods to assess difficulty:

# Model-based filtering
def assess_difficulty(question):
    model1_result = Qwen2.5_7B.evaluate(question)
    model2_result = Qwen2.5_32B.evaluate(question)
    
    # Remove if either model can solve it (too easy)
    if model1_result.correct or model2_result.correct:
        return "too_easy"
        
    # Measure reasoning trace length
    token_length = Qwen2.5_tokenizer.count_tokens(question.reasoning_trace)
    return token_length

Stage 3: Diversity Selection

def select_diverse_samples(questions, target_size=1000):
    domains = classify_into_domains(questions)  # Using MSC system
    selected = []
    
    while len(selected) < target_size:
        # Randomly choose a domain
        domain = random.choice(list(domains.keys()))
        
        # Sample from domain favoring longer reasoning traces
        questions_in_domain = domains[domain]
        weights = calculate_length_based_weights(questions_in_domain)
        selected_question = weighted_sample(questions_in_domain, weights)
        
        selected.append(selected_question)
        
    return selected

Note: What is MSC system?
Initial Dataset (59,029) Stage 1: Quality Filtering • Remove API errors (→ 54,116) • Filter formatting issues (→ 51,581) • Select high-quality samples (→ 384 initial samples) Stage 2: Difficulty Filtering • Model-based filtering (Qwen2.5-7B and 32B) • Measure reasoning trace length • Remove easy questions (→ 24,496) Stage 3: Diversity Selection • MSC domain classification • Random domain selection with length-based weights • Final selection (→ 1,000 samples)

Data Format:

Each sample contains:

Example data structure:

class DataSample:
    def __init__(self):
        self.question: str  # The problem statement
        self.reasoning_trace: str  # Step-by-step reasoning
        self.solution: str  # Final answer
        self.domain: str  # Mathematical domain
        self.difficulty_score: float  # Based on model performance
        self.token_length: int  # Length of reasoning trace

Domain Distribution:

Quality Control Measures:

def decontaminate_dataset(samples, test_sets):
    """Remove samples that overlap with evaluation sets"""
    clean_samples = []
    for sample in samples:
        if not has_ngram_overlap(sample, test_sets, n=8):
            clean_samples.append(sample)
    return clean_samples

def validate_quality(sample):
    """Check various quality criteria"""
    checks = [
        not has_formatting_issues(sample),
        not has_ascii_art(sample),
        has_consistent_numbering(sample),
        reasoning_trace_is_complete(sample)
    ]
    return all(checks)

Technical Implementation Details:

The final selection algorithm:

def create_s1k_dataset():
    # Stage 1: Quality
    initial_samples = load_59k_samples()
    quality_filtered = filter_quality(initial_samples)
    
    # Stage 2: Difficulty
    difficulty_filtered = []
    for sample in quality_filtered:
        if is_sufficiently_difficult(sample):
            difficulty_filtered.append(sample)
    
    # Stage 3: Diversity
    final_samples = select_diverse_samples(
        difficulty_filtered, 
        target_size=1000
    )
    
    # Final decontamination
    final_samples = decontaminate_dataset(
        final_samples, 
        test_sets=[MATH500, GPQA, AIME24]
    )
    
    return final_samples

This careful curation process ensures that the final 1,000 samples are:

  1. High quality and well-formatted
  2. Sufficiently challenging
  3. Diverse across mathematical domains
  4. Not contaminated with test data
  5. Have complete and coherent reasoning traces