OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models

Jun Wang,Meng Fang,Ziyu Wan,Muning Wen,Jiachen Zhu,Anjie Liu,Ziqin Gong,Yan Song,Lei Chen,Lionel M. Ni,Linyi Yang,Ying Wen,Weinan Zhang
2024-10-13
Abstract:In this technical report, we introduce OpenR, an open-source framework designed to integrate key components for enhancing the reasoning capabilities of large language models (LLMs). OpenR unifies data acquisition, reinforcement learning training (both online and offline), and non-autoregressive decoding into a cohesive software platform. Our goal is to establish an open-source platform and community to accelerate the development of LLM reasoning. Inspired by the success of OpenAI's o1 model, which demonstrated improved reasoning abilities through step-by-step reasoning and reinforcement learning, OpenR integrates test-time compute, reinforcement learning, and process supervision to improve reasoning in LLMs. Our work is the first to provide an open-source framework that explores the core techniques of OpenAI's o1 model with reinforcement learning, achieving advanced reasoning capabilities beyond traditional autoregressive methods. We demonstrate the efficacy of OpenR by evaluating it on the MATH dataset, utilising publicly available data and search methods. Our initial experiments confirm substantial gains, with relative improvements in reasoning and performance driven by test-time computation and reinforcement learning through process reward models. The OpenR framework, including code, models, and datasets, is accessible at <a class="link-external link-https" href="https://openreasoner.github.io" rel="external noopener nofollow">this https URL</a>.
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the performance issues of large language models (LLMs) in complex reasoning tasks. Specifically, it introduces an open-source framework called OpenR, which aims to enhance the reasoning capabilities of large language models by integrating key components such as data acquisition, reinforcement learning training (both online and offline), and non-autoregressive decoding. ### Main Contributions 1. **Integration of Key Components**: OpenR integrates data acquisition, reinforcement learning training, and non-autoregressive decoding into a unified software platform. 2. **Open Source Platform**: Establishing an open-source platform and community to accelerate the development of LLM reasoning. 3. **Test-Time Computation**: Improving LLM reasoning capabilities through test-time computation and process supervision. 4. **Experimental Validation**: Conducting experiments on the MATH dataset to validate the effectiveness of OpenR, demonstrating significant performance improvements. ### Background and Motivation - **Limitations of Existing Methods**: Existing LLMs can generate quick responses but lack complex reasoning capabilities. Most methods rely on external prompt systems and cannot truly embed Chain-of-Thought (CoT) capabilities. - **OpenAI's o1 Model**: OpenAI's o1 model achieved significant performance improvements in fields like mathematics and programming by explicitly embedding the chain-of-thought process, inspiring the design of OpenR. - **Human Cognitive Models**: Drawing from human cognition's System 1 (fast, automatic) and System 2 (slow, deliberative) modes, OpenR aims to simulate the human deliberative process. ### Methodology - **Markov Decision Process (MDP)**: Modeling reasoning tasks as MDPs allows the model to generate reasoning steps incrementally and explore multiple reasoning paths through a tree structure. - **Process Reward Model (PRM)**: Providing feedback on the quality of reasoning steps and final answers through PRM guides the model to generate accurate and meaningful reasoning processes. - **Data Augmentation**: Using automated methods to generate synthetic samples reduces reliance on expensive human-labeled data, enabling more scalable data collection. - **Supervised Training**: Fine-tuning PRM through supervised training as a binary classification task to judge the correctness of each reasoning step. - **Policy Learning**: Training LLMs through reinforcement learning algorithms (such as PPO and GRPO) to continuously optimize and improve during the reasoning process. - **Decoding Strategies**: Using PRM to evaluate the accuracy of each solution step during testing and selecting the best answer through various strategies (such as majority voting, maximum reward, etc.). ### Experimental Results - **MATH Dataset**: Experiments on the MATH dataset show that combining process reward models and guided search methods can significantly improve test-time reasoning performance, with a relative improvement of approximately 10%. ### Conclusion OpenR is an open-source framework that significantly enhances the reasoning capabilities of large language models by integrating test-time computation and process supervision. The framework provides researchers with an open platform, promoting further development in the field of LLM reasoning.