Abstract:Recent research has generated hope that inference scaling could allow weaker language models to match or exceed the accuracy of stronger models, such as by repeatedly sampling solutions to a coding problem until it passes unit tests. The central thesis of this paper is that there is no free lunch for inference scaling: indefinite accuracy improvement through resampling can only be realized if the "verifier" (in this case, a set of unit tests) is perfect. When the verifier is imperfect, as it almost always is in domains such as reasoning or coding (for example, unit tests have imperfect coverage), there is a nonzero probability of false positives: incorrect solutions that pass the verifier. Resampling cannot decrease this probability, so it imposes an upper bound to the accuracy of resampling-based inference scaling even with an infinite compute budget. We find that there is a very strong correlation between the model's single-sample accuracy (i.e. accuracy without unit tests) and its false positive rate on coding benchmarks HumanEval and MBPP, whose unit tests have limited coverage. Therefore, no amount of inference scaling of weaker models can enable them to match the single-sample accuracy of a sufficiently strong model (Fig. 1a). When we consider that false positives have a negative utility compared to abstaining from producing a solution, it bends the inference scaling curve further downward. Empirically, we find that the optimal number of samples can be less than 10 under realistic assumptions (Fig. 1b). Finally, we show that beyond accuracy, false positives may have other undesirable qualities, such as poor adherence to coding style conventions.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper explores the limitations faced when scaling inference through resampling, especially in the case of using an imperfect verifier. Specifically, the main research questions in the paper are as follows: 1. **Limitations of Inference Scaling**: - The paper questions whether weaker language models (LLMs) can reach or exceed the performance of stronger models on certain tasks by increasing computational resources (such as resampling). - The author points out that infinite resampling can only achieve infinite accuracy improvement when the verifier is perfect without error. However, in the real world, verifiers are usually imperfect (for example, insufficient unit test coverage), which will lead to wrong solutions passing the verification. 2. **The Impact of Verifier Imperfection on Model Performance**: - An imperfect verifier will introduce "false positives", that is, wrong solutions pass the verification. Resampling cannot reduce this probability, so even with infinite computational resources, there is an upper limit on accuracy. - Research shows that weaker models are more likely to produce false positives, resulting in poorer generalization ability and unable to match the one - sample accuracy of stronger models. 3. **The Finiteness of the Optimal Sampling Times**: - When considering the negative effect of false positives, the return curve of resampling will further decline. Experimental results show that even in the case of zero computational cost, the optimal sampling times are finite and very low (usually less than 10 times). - If the cost of false positives is higher than the benefit of the correct solution, then the optimal sampling times can even be zero. 4. **The Impact of Code Quality**: - In addition to functional correctness, the paper also finds that false - positive solutions are also poor in terms of code quality (such as readability, naming conventions, etc.). This indicates that relying on an imperfect verifier not only affects the functional correctness of the code but also reduces the overall code quality. ### Main Conclusions - **No Free Lunch**: The premise of infinitely improving the performance of weak models through resampling is having a perfect verifier. But in reality, verifiers are often imperfect, which limits the effect of resampling. - **Generalization Gap**: Weaker models will produce more false positives when facing imperfect verifiers, resulting in a larger generalization gap. - **Optimality of Sampling Times**: In practical applications, considering the cost of false positives, the optimal sampling times are usually finite and low. - **Code Quality Degradation**: False - positive solutions are not only unreliable in terms of function but also perform poorly in terms of code quality. ### Significance These findings emphasize the importance of constructing high - precision verifiers and point out the limitations of current inference - scaling methods based on imperfect verifiers. Future research needs to explore how to improve the quality of verifiers to better support the application of inference - scaling techniques.

Inference Scaling $\scriptsize\mathtt{F}$Laws: The Limits of LLM Resampling with Imperfect Verifiers

A Simple Model of Inference Scaling Laws

Keep Guessing? When Considering Inference Scaling, Mind the Baselines

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold

Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems

Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification

Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data

Inverse Scaling: When Bigger Isn't Better

Scaling Laws for Precision

A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models

Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Inference Scaling for Long-Context Retrieval Augmented Generation

Scaling Laws for Multilingual Language Models

Revisiting the Superficial Alignment Hypothesis

The case for 4-bit precision: k-bit Inference Scaling Laws