Introduction of budget forcing

Let me break down the budget forcing technique in detail:

Core Concept of Budget Forcing:

The technique has two main operations:

a) Maximum Token Control:

if tokens > limit : append (END_TOKEN)

b) Minimum Token Extension:

if tokens < target : append ("Wait")

Mathematical Formulation:

For evaluation metrics, they define three key components:

a) Control Metric:

Control = \frac{1}{| A |} \sum_{a \in A} 1 (a_{min} \leq a \leq a_{max})

Where:

$A$ is the set of evaluations
$a_{min}, a_{max}$ are pre-specified compute bounds
$1$ is the indicator function

b) Scaling Metric:

Scaling = \frac{1}{(\binom{| A |}{2})} \sum_{\begin{matrix} a, b \in A \\ b > a \end{matrix}} \frac{f (b) - f (a)}{b - a}

Where:

$f (x)$ is the piece-wise linear function of compute vs accuracy
The metric measures the average slope of performance improvement

c) Performance Metric:

Performance = max_{a \in A} f (a)

Implementation Details:

class BudgetForcing:
    def __init__(self, min_tokens=None, max_tokens=None):
        self.min_tokens = min_tokens
        self.max_tokens = max_tokens
        
    def control_generation(self, current_tokens, current_text):
        # Check maximum limit
        if self.max_tokens and current_tokens >= self.max_tokens:
            return self._force_end()
            
        # Check minimum requirement
        if self.min_tokens and current_tokens < self.min_tokens:
            if self._is_trying_to_end(current_text):
                return self._append_wait()
                
        return None
        
    def _force_end(self):
        return "\nFinal Answer:"
        
    def _append_wait(self):
        return "\nWait"
        
    def _is_trying_to_end(self, text):
        return text.endswith("Final Answer:") or text.endswith("Therefore,")

Test-time Scaling Behavior:

The relationship between compute and performance can be modeled as:

P (c) = P_{0} + α \log (c / c_{0}) + ϵ

Where:

$P (c)$ is performance at compute level $c$
$P_{0}$ is base performance
$α$ is the scaling coefficient
$ϵ$ is noise term

What is Test-time Scaling

Test-time Scaling is broader than just controlling length - it's about how model performance improves with additional compute at inference time.

Instead of improving model performance by training longer or using more parameters, we improve performance by allowing more "thinking time" during inference. This is measured in terms of tokens generated during reasoning

It's not just about controlling length but about scaling compute resources. The paper shows that more "thinking time" often leads to better answers. They demonstrate this with scaling curves showing performance improvements. For example: AIME24 performance scales from 50% → 57% with more compute

Example:

Comparison

Traditional scaling: Train bigger models or on more data
Test-time scaling: Same model, but allow more compute during inference

Budget forcing is just one method to achieve this scaling
Other methods like majority voting or tree search can also work

Comparison with Other Methods:

They compared budget forcing with:

a) Token-conditional control:

{Performance}_{t c} = f (min (T_{actual}, T_{target}))

b) Step-conditional control:

{Performance}_{s c} = f (\sum_{i = 1}^{N} min (S_{i}, S_{target}))

c) Class-conditional control:

{Performance}_{c c} = E_{c \sim C} [f (T_{c})]

Budget forcing outperformed these alternatives on their metrics:

Method	Control	Scaling	Performance
Budget Forcing	100%	15	56.7
Token-conditional	40%	-24	40.0
Step-conditional	60%	3	36.7
Class-conditional	50%	25	36.7

Practical Implementation:

def apply_budget_forcing(model_output, current_tokens, max_tokens):
    """
    Apply budget forcing to model output
    
    Parameters:
        model_output: Current generation
        current_tokens: Number of tokens generated
        max_tokens: Maximum allowed tokens
    """
    # Check if approaching token limit
    if current_tokens >= max_tokens * 0.95:
        # Force end generation
        return force_end_generation(model_output)
        
    # Check if model is trying to end early
    if is_attempting_end(model_output):
        if current_tokens < max_tokens * 0.8:
            # Encourage more thinking
            return append_wait_token(model_output)
            
    return model_output

The success of budget forcing comes from its ability to:

Maintain strict control over compute usage
Allow natural scaling of reasoning depth
Prevent premature termination
Enable graceful degradation at compute limits

Few Questions of budget forcing

How much compute we allow during inference?
If we using budget force, then we have to manually set a compute time? How long is a good time?
How this additional compute translates to better performance?
I believe this is a new scalling law, the more time to put the better result you got, but does this also apply to here?
How we can control and optimize this trade-off?
Yeah how did they find the best control? and how do they evaluate it?

Check answer here: Questions of budget forcing