Abstract:Large Language Models (LLMs) employing Chain-of-Thought (CoT) prompting have broadened the scope for improving multi-step reasoning capabilities. We generally divide multi-step reasoning into two phases: path generation to generate the reasoning path(s); and answer calibration post-processing the reasoning path(s) to obtain a final answer. However, the existing literature lacks systematic analysis on different answer calibration approaches. In this paper, we summarize the taxonomy of recent answer calibration techniques and break them down into step-level and path-level strategies. We then conduct a thorough evaluation on these strategies from a unified view, systematically scrutinizing step-level and path-level answer calibration across multiple paths. Experimental results reveal that integrating the dominance of both strategies tends to derive optimal outcomes. Our study holds the potential to illuminate key insights for optimizing multi-step reasoning with answer calibration.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of the lack of systematic analysis of answer calibration methods in the multi - step reasoning process. Specifically, the paper focuses on the following aspects: 1. **Two main stages of multi - step reasoning**: - **Path Generation**: Generate one or more reasoning paths. - **Answer Calibration**: Post - process the generated reasoning paths to obtain the final answer. 2. **Deficiencies in existing literature**: - Existing literature lacks a systematic analysis of different answer calibration methods. - There is a lack of comprehensive comparison and evaluation of step - level and path - level calibration strategies. 3. **Research objectives**: - **Summarize and classify**: Summarize recent answer calibration techniques and divide them into step - level and path - level strategies. - **Unified view**: Conduct a comprehensive evaluation of these strategies from a unified perspective, and systematically examine the performance of step - level and path - level answer calibration on multiple paths. - **Optimize multi - step reasoning**: Reveal key insights through answer calibration, optimize the multi - step reasoning process, and ensure accurate, consistent and reliable reasoning results. 4. **Specific research questions**: - **Condition analysis**: Explore under which specific conditions answer calibration significantly improves multi - step reasoning performance. - **Advantages and disadvantages of strategies**: Analyze the advantages and disadvantages of step - level and path - level answer calibration, and how to achieve optimal performance. - **Robustness and generalization ability**: Evaluate the robustness and generalization ability of answer calibration strategies. ### Main contributions of the paper - **Systematic analysis**: For the first time, a systematic analysis of different answer calibration methods has been carried out. - **Unified framework**: A unified framework is proposed, which combines step - level and path - level calibration strategies. - **Experimental verification**: Through five representative multi - step reasoning tasks (involving arithmetic and common - sense reasoning), the effects of different calibration strategies are verified. - **Key finding**: It is found that combining step - level and path - level calibration strategies usually achieves the best results, especially in the zero - sample scenario. ### Conclusion Through systematic analysis and experimental verification, this paper provides new perspectives and methods for optimizing answer calibration in the multi - step reasoning process, which helps to improve the performance of large - language models in multi - step reasoning tasks.

Towards A Unified View of Answer Calibration for Multi-Step Reasoning

Answering Questions by Meta-Reasoning over Multiple Chains of Thought

DCR: Divide-and-Conquer Reasoning for Multi-choice Question Answering with LLMs

Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models

Let's Be Self-generated via Step by Step: A Curriculum Learning Approach to Automated Reasoning with Large Language Models

Tree-of-Reasoning Question Decomposition for Complex Question Answering with Large Language Models

Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models

LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models

Step Guided Reasoning: Improving Mathematical Reasoning using Guidance Generation and Step Reasoning

Calibrating Reasoning in Language Models with Internal Consistency

Optimizing Chain-of-Thought Reasoning: Tackling Arranging Bottleneck via Plan Augmentation

Enhancing the Completeness of Rationales for Multi-Step Question Answering

CoQ:AN Empirical Framework for Multi-hop Question Answering Empowered by Large Language Models

Distilling Reasoning Ability from Large Language Models with Adaptive Thinking

Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation

How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning

MAF: Multi-Aspect Feedback for Improving Reasoning in Large Language Models

The Impact of Reasoning Step Length on Large Language Models

Multimodal Chain-of-Thought Reasoning in Language Models

Chain-of-Probe: Examing the Necessity and Accuracy of CoT Step-by-Step

Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning