Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

Haolin Chen,Yihao Feng,Zuxin Liu,Weiran Yao,Akshara Prabhakar,Shelby Heinecke,Ricky Ho,Phil Mui,Silvio Savarese,Caiming Xiong,Huan Wang
2024-11-07
Abstract:Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at inference time, optimizing reasoning capabilities during training remains challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution and optimizes it via variational approaches. LaTRO enables LLMs to concurrently improve both their reasoning process and ability to evaluate reasoning quality, without requiring external feedback or reward models. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning across Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced through our proposed optimization approach in a self-improvement manner. The code of LaTRO is available at \url{<a class="link-external link-https" href="https://github.com/SalesforceAIResearch/LaTRO" rel="external noopener nofollow">this https URL</a>}.
Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper "Language Models Are Hidden Reasoners: Unlocking Latent Reasoning Capabilities Via Self-Rewarding" aims to address the shortcomings of large language models (LLMs) in complex multi-step reasoning tasks. Despite the impressive capabilities demonstrated by large language models, they still perform poorly in complex tasks that require multiple reasoning steps. Existing prompt-based methods (such as Chain-of-Thought, CoT) can improve the reasoning capabilities of LLMs during inference, but optimizing these reasoning capabilities during training remains a challenge. ### Main Contributions 1. **Theoretical Framework**: A framework named LaTentReasoning Optimization (LaTRO) is proposed, which views the reasoning process as sampling from a latent distribution and optimizes it through variational methods. 2. **Self-Rewarding Mechanism**: Utilizes the model's own probability estimates for self-rewarding, thereby improving both the reasoning process and the ability to evaluate reasoning quality without the need for external feedback or reward models. 3. **Performance Improvement**: Experimental results show that LaTRO significantly improves performance across multiple model architectures and reasoning tasks, particularly with an average zero-shot accuracy improvement of 12.5% on the GSM8K dataset. ### Solution LaTRO addresses the aforementioned issues through the following approaches: - **Variational Optimization**: Models the reasoning process as sampling from a latent distribution and optimizes this distribution through variational methods. - **Self-Rewarding Mechanism**: Uses the probability estimates of the reasoning paths generated by the model itself to update model parameters, thereby enhancing the ability to generate high-quality reasoning paths. - **No External Feedback Required**: The entire optimization process does not rely on external feedback or reward models, allowing the model to gradually improve its reasoning capabilities through self-improvement. ### Experimental Validation The paper validates the effectiveness of LaTRO through experiments on the GSM8K and ARC-Challenge datasets. The experimental results show that LaTRO not only significantly enhances the model's reasoning capabilities in zero-shot settings but also performs well in comparison with supervised fine-tuning (SFT) baseline models. ### Conclusion The findings of the paper indicate that pre-trained large language models inherently possess latent reasoning capabilities, which can be effectively unlocked and enhanced through the optimization methods proposed by LaTRO. This approach not only improves the reasoning performance of the models but also demonstrates that pre-trained models can serve as explicit reward models to evaluate the quality of reasoning paths.