Abstract:Despite significant advancements in the general capability of large language models (LLMs), they continue to struggle with consistent and accurate reasoning, especially in complex tasks such as mathematical and code reasoning. One key limitation is that LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors, which hampers their ability to reliably verify and rank outputs. To address this, we scale up the inference-time computation by generating multiple reasoning paths and employing verifiers to assess and rank the generated outputs by correctness. To facilitate this, we introduce a comprehensive dataset consisting of correct and incorrect solutions for math and code tasks, generated by multiple LLMs. This diverse set of solutions enables verifiers to more effectively distinguish and rank correct answers from erroneous outputs. The training methods for building verifiers were selected based on an extensive comparison of existing approaches. Moreover, to leverage the unique strengths of different reasoning strategies, we propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification. CoT provides a clear, step-by-step reasoning process that enhances interpretability, while PoT, being executable, offers a precise and error-sensitive validation mechanism. By taking both of their strengths, our approach significantly improves the accuracy and reliability of reasoning verification. Our verifiers, Math-Rev and Code-Rev, demonstrate substantial performance gains to existing LLMs, achieving state-of-the-art results on benchmarks such as GSM8k and MATH and even outperforming GPT-4o with Qwen-72B-Instruct as the reasoner.

CodeMind: A Framework to Challenge Large Language Models for Code Reasoning

Concise and Organized Perception Facilitates Large Language Models for Deductive Reasoning.

When Do Program-of-Thought Works for Reasoning?

Reasoning Runtime Behavior of a Program with LLM: How Far Are We?

Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification

When Do Program-of-Thoughts Work for Reasoning?

Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

LogicPro: Improving Complex Logical Reasoning via Program-Guided Learning

Can LLMs Reason in the Wild with Programs?

Enhancing Code Generation Performance of Smaller Models by Distilling the Reasoning Ability of LLMs

StrategyLLM: Large Language Models as Strategy Generators, Executors, Optimizers, and Evaluators for Problem Solving

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models

VISUALCODER: Guiding Large Language Models in Code Execution with Fine-grained Multimodal Chain-of-Thought Reasoning

The Magic of IF: Investigating Causal Reasoning Abilities in Large Language Models of Code

What Makes Large Language Models Reason in (Multi-Turn) Code Generation?

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

CLR-Fact: Evaluating the Complex Logical Reasoning Capability of Large Language Models over Factual Knowledge

LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models