Abstract:Despite significant advancements in the general capability of large language models (LLMs), they continue to struggle with consistent and accurate reasoning, especially in complex tasks such as mathematical and code reasoning. One key limitation is that LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors, which hampers their ability to reliably verify and rank outputs. To address this, we scale up the inference-time computation by generating multiple reasoning paths and employing verifiers to assess and rank the generated outputs by correctness. To facilitate this, we introduce a comprehensive dataset consisting of correct and incorrect solutions for math and code tasks, generated by multiple LLMs. This diverse set of solutions enables verifiers to more effectively distinguish and rank correct answers from erroneous outputs. The training methods for building verifiers were selected based on an extensive comparison of existing approaches. Moreover, to leverage the unique strengths of different reasoning strategies, we propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification. CoT provides a clear, step-by-step reasoning process that enhances interpretability, while PoT, being executable, offers a precise and error-sensitive validation mechanism. By taking both of their strengths, our approach significantly improves the accuracy and reliability of reasoning verification. Our verifiers, Math-Rev and Code-Rev, demonstrate substantial performance gains to existing LLMs, achieving state-of-the-art results on benchmarks such as GSM8k and MATH and even outperforming GPT-4o with Qwen-72B-Instruct as the reasoner.

Enhancing Mathematical Reasoning in LLMs with Background Operators

LLM Reasoning Engine: Specialized Training for Enhanced Mathematical Reasoning

Arithmetic Reasoning with LLM: Prolog Generation & Permutation

Logic Contrastive Reasoning with Lightweight Large Language Model for Math Word Problems

Exploring an LM to generate Prolog Predicates from Mathematics Questions

LogicPro: Improving Complex Logical Reasoning via Program-Guided Learning

Modeling Complex Mathematical Reasoning via Large Language Model based MathAgent

InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning

Enhancing Mathematical Reasoning in LLMs by Stepwise Correction

MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time

Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification

Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From A Psychological Perspective

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning

MathPrompter: Mathematical Reasoning using Large Language Models

INC-Math: Integrating Natural Language and Code for Enhanced Mathematical Reasoning in Large Language Models

Multilingual Mathematical Reasoning: Advancing Open-Source LLMs in Hindi and English

Large Language Models for Mathematical Reasoning: Progresses and Challenges