Abstract:Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at inference time, optimizing reasoning capabilities during training remains challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution and optimizes it via variational approaches. LaTRO enables LLMs to concurrently improve both their reasoning process and ability to evaluate reasoning quality, without requiring external feedback or reward models. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning across Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced through our proposed optimization approach in a self-improvement manner. The code of LaTRO is available at \url{<a class="link-external link-https" href="https://github.com/SalesforceAIResearch/LaTRO" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper "Language Models Are Hidden Reasoners: Unlocking Latent Reasoning Capabilities Via Self-Rewarding" aims to address the shortcomings of large language models (LLMs) in complex multi-step reasoning tasks. Despite the impressive capabilities demonstrated by large language models, they still perform poorly in complex tasks that require multiple reasoning steps. Existing prompt-based methods (such as Chain-of-Thought, CoT) can improve the reasoning capabilities of LLMs during inference, but optimizing these reasoning capabilities during training remains a challenge. ### Main Contributions 1. **Theoretical Framework**: A framework named LaTentReasoning Optimization (LaTRO) is proposed, which views the reasoning process as sampling from a latent distribution and optimizes it through variational methods. 2. **Self-Rewarding Mechanism**: Utilizes the model's own probability estimates for self-rewarding, thereby improving both the reasoning process and the ability to evaluate reasoning quality without the need for external feedback or reward models. 3. **Performance Improvement**: Experimental results show that LaTRO significantly improves performance across multiple model architectures and reasoning tasks, particularly with an average zero-shot accuracy improvement of 12.5% on the GSM8K dataset. ### Solution LaTRO addresses the aforementioned issues through the following approaches: - **Variational Optimization**: Models the reasoning process as sampling from a latent distribution and optimizes this distribution through variational methods. - **Self-Rewarding Mechanism**: Uses the probability estimates of the reasoning paths generated by the model itself to update model parameters, thereby enhancing the ability to generate high-quality reasoning paths. - **No External Feedback Required**: The entire optimization process does not rely on external feedback or reward models, allowing the model to gradually improve its reasoning capabilities through self-improvement. ### Experimental Validation The paper validates the effectiveness of LaTRO through experiments on the GSM8K and ARC-Challenge datasets. The experimental results show that LaTRO not only significantly enhances the model's reasoning capabilities in zero-shot settings but also performs well in comparison with supervised fine-tuning (SFT) baseline models. ### Conclusion The findings of the paper indicate that pre-trained large language models inherently possess latent reasoning capabilities, which can be effectively unlocked and enhanced through the optimization methods proposed by LaTRO. This approach not only improves the reasoning performance of the models but also demonstrates that pre-trained models can serve as explicit reward models to evaluate the quality of reasoning paths.

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance

Improving Language Model Reasoning with Self-motivated Learning

Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models

Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards

Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning

GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements

On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models

Rational Metareasoning for Large Language Models

Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation

Learning to Reason via Self-Iterative Process Feedback for Small Language Models

Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic

Reasoning with Large Language Models, a Survey

Small Language Models Fine-tuned to Coordinate Larger Language Models improve Complex Reasoning

On Designing Effective RL Reward at Training Time for LLM Reasoning

Large Language Models are Zero-Shot Reasoners

Large Language Models Can Learn Temporal Reasoning

Training Chain-of-Thought via Latent-Variable Inference

Enhancing the Reasoning Capabilities of Small Language Models via Solution Guidance Fine-Tuning