Abstract:In this technical report, we introduce OpenR, an open-source framework designed to integrate key components for enhancing the reasoning capabilities of large language models (LLMs). OpenR unifies data acquisition, reinforcement learning training (both online and offline), and non-autoregressive decoding into a cohesive software platform. Our goal is to establish an open-source platform and community to accelerate the development of LLM reasoning. Inspired by the success of OpenAI's o1 model, which demonstrated improved reasoning abilities through step-by-step reasoning and reinforcement learning, OpenR integrates test-time compute, reinforcement learning, and process supervision to improve reasoning in LLMs. Our work is the first to provide an open-source framework that explores the core techniques of OpenAI's o1 model with reinforcement learning, achieving advanced reasoning capabilities beyond traditional autoregressive methods. We demonstrate the efficacy of OpenR by evaluating it on the MATH dataset, utilising publicly available data and search methods. Our initial experiments confirm substantial gains, with relative improvements in reasoning and performance driven by test-time computation and reinforcement learning through process reward models. The OpenR framework, including code, models, and datasets, is accessible at <a class="link-external link-https" href="https://openreasoner.github.io" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the performance issues of large language models (LLMs) in complex reasoning tasks. Specifically, it introduces an open-source framework called OpenR, which aims to enhance the reasoning capabilities of large language models by integrating key components such as data acquisition, reinforcement learning training (both online and offline), and non-autoregressive decoding. ### Main Contributions 1. **Integration of Key Components**: OpenR integrates data acquisition, reinforcement learning training, and non-autoregressive decoding into a unified software platform. 2. **Open Source Platform**: Establishing an open-source platform and community to accelerate the development of LLM reasoning. 3. **Test-Time Computation**: Improving LLM reasoning capabilities through test-time computation and process supervision. 4. **Experimental Validation**: Conducting experiments on the MATH dataset to validate the effectiveness of OpenR, demonstrating significant performance improvements. ### Background and Motivation - **Limitations of Existing Methods**: Existing LLMs can generate quick responses but lack complex reasoning capabilities. Most methods rely on external prompt systems and cannot truly embed Chain-of-Thought (CoT) capabilities. - **OpenAI's o1 Model**: OpenAI's o1 model achieved significant performance improvements in fields like mathematics and programming by explicitly embedding the chain-of-thought process, inspiring the design of OpenR. - **Human Cognitive Models**: Drawing from human cognition's System 1 (fast, automatic) and System 2 (slow, deliberative) modes, OpenR aims to simulate the human deliberative process. ### Methodology - **Markov Decision Process (MDP)**: Modeling reasoning tasks as MDPs allows the model to generate reasoning steps incrementally and explore multiple reasoning paths through a tree structure. - **Process Reward Model (PRM)**: Providing feedback on the quality of reasoning steps and final answers through PRM guides the model to generate accurate and meaningful reasoning processes. - **Data Augmentation**: Using automated methods to generate synthetic samples reduces reliance on expensive human-labeled data, enabling more scalable data collection. - **Supervised Training**: Fine-tuning PRM through supervised training as a binary classification task to judge the correctness of each reasoning step. - **Policy Learning**: Training LLMs through reinforcement learning algorithms (such as PPO and GRPO) to continuously optimize and improve during the reasoning process. - **Decoding Strategies**: Using PRM to evaluate the accuracy of each solution step during testing and selecting the best answer through various strategies (such as majority voting, maximum reward, etc.). ### Experimental Results - **MATH Dataset**: Experiments on the MATH dataset show that combining process reward models and guided search methods can significantly improve test-time reasoning performance, with a relative improvement of approximately 10%. ### Conclusion OpenR is an open-source framework that significantly enhances the reasoning capabilities of large language models by integrating test-time computation and process supervision. The framework provides researchers with an open platform, promoting further development in the field of LLM reasoning.

OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models

A Comparative Study on Reasoning Patterns of OpenAI's o1 Model

Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions

Technical Report: Enhancing LLM Reasoning with Reward-guided Tree Search

Open-RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source Large Language Models

The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models

REL: Working out is all you need

Mastering the Task of Open Information Extraction with Large Language Models and Consistent Reasoning Environment

OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Democratizing Reasoning Ability: Tailored Learning from Large Language Model

InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning

Reasoning with Language Model is Planning with World Model

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Aristotle: Mastering Logical Reasoning with A Logic-Complete Decompose-Search-Resolve Framework

K-Level Reasoning: Establishing Higher Order Beliefs in Large Language Models for Strategic Reasoning

Reasoning in Conversation: Solving Subjective Tasks through Dialogue Simulation for Large Language Models

Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning

Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning