Abstract:Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.

What problem does this paper attempt to address?

The paper primarily explores how to improve the performance of large language models (LLMs) by optimizing their computation during the inference phase, especially when handling open-ended natural language tasks. Specifically, the core question of the research is: Given a fixed, non-trivial amount of inference-time computation, to what extent can LLMs improve the quality of their responses to challenging prompts? The main contributions of the paper can be summarized as follows: 1. **Exploring Effective Utilization of Test-Time Computation**: - Analyzed two main mechanisms to extend test-time computation: one is search based on dense process verification reward models; the other is adaptively updating the model's response distribution based on prompt information. - Found that the effectiveness of different strategies depends on the difficulty of the prompts and the baseline model used. 2. **Proposing Computation-Optimal Scaling Strategy**: - Proposed a "computation-optimal" scaling strategy aimed at allocating test-time computation resources most effectively for each prompt. - Using this strategy, the authors demonstrated a significant improvement in test-time computation efficiency compared to the best-of-N baseline method, reducing computation by 4 times. 3. **Comparing Test-Time and Pretraining Computation**: - In FLOPs-matched evaluations, the study found that for smaller base models, increasing test-time computation can outperform pretraining models approximately 14 times larger on certain problems. - Results indicate that for simpler problems, test-time computation generally outperforms additional pretraining; whereas for more difficult problems, more pretraining is needed to achieve better results. 4. **Unifying Test-Time Computation Perspectives**: - Unified different methods of test-time computation and analyzed several representative methods, including modifying proposal distributions and optimizing verifiers. 5. **Extending Test-Time Computation via Verifiers**: - Investigated different test-time search methods, particularly those combined with process verifiers (PRMs). - Included best-of-N weighted, beam search, and lookahead search, the latter of which improves value estimation at each step by utilizing lookahead rollouts during the search process. In summary, this paper demonstrates through a series of experiments that rational allocation of test-time computation resources can significantly enhance the performance of LLMs when dealing with complex problems. It also reveals the trade-off relationship between test-time computation and pretraining computation. These findings are important for advancing the construction of more general, self-improving agents and optimizing future LLM pretraining strategies.

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models

Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Establishing Task Scaling Laws via Compute-Efficient Model Ladders

Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

On Speeding Up Language Model Evaluation

Inference Scaling for Long-Context Retrieval Augmented Generation

More Compute Is What You Need

Inference Scaling $\scriptsize\mathtt{F}$Laws: The Limits of LLM Resampling with Imperfect Verifiers

Scaling Laws for Predicting Downstream Performance in LLMs

A Simple Model of Inference Scaling Laws

The Larger the Better? Improved LLM Code-Generation via Budget Reallocation

Navigating Scaling Laws: Compute Optimality in Adaptive Model Training

Language models scale reliably with over-training and on downstream tasks

Unlock Predictable Scaling from Emergent Abilities

Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

Learning How Hard to Think: Input-Adaptive Allocation of LM Computation

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency