Abstract:While Large Language Models (LLMs) have demonstrated proficiency in handling complex queries, much of the past work has depended on extensively annotated datasets by human experts. However, this reliance on fully-supervised annotations poses scalability challenges, particularly as models and data requirements grow. To mitigate this, we explore the potential of enhancing LLMs' reasoning abilities with minimal human supervision. In this work, we introduce self-reinforcement, which begins with Supervised Fine-Tuning (SFT) of the model using a small collection of annotated questions. Then it iteratively improves LLMs by learning from the differences in responses from the SFT and unfinetuned models on unlabeled questions. Our approach provides an efficient approach without relying heavily on extensive human-annotated explanations. However, current reasoning benchmarks typically only include golden-reference answers or rationales. Therefore, we present \textsc{PuzzleBen}, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales across various domains, such as brainteasers, puzzles, riddles, parajumbles, and critical reasoning tasks. A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities. Our experiments underscore the significance of \textsc{PuzzleBen}, as well as the effectiveness of our methodology as a promising direction in future endeavors. Our dataset and code will be published soon on \texttt{Anonymity Link}.

What problem does this paper attempt to address?

The main aim of this paper is to address the dependency of large language models (LLMs) on a substantial amount of manually annotated data when handling complex reasoning tasks. This dependency not only presents scalability challenges but also increases time and labor costs. To tackle these issues, the authors propose a method called "self-reinforcement." ### Research Objectives 1. **Reduce the need for human supervision**: Improve the reasoning ability of LLMs through a weakly supervised learning approach, thereby reducing reliance on large-scale manually annotated datasets. 2. **Iteratively improve the model**: Propose an iterative weak-to-strong learning framework that starts with a small amount of manually annotated data and gradually uses unannotated data to enhance the model's reasoning ability. 3. **Create new datasets**: To validate the effectiveness of this approach, the authors constructed a dataset called PUZZLEBEN, which includes complex problems such as puzzles and brainteasers, and contains a portion of unannotated questions to support weakly supervised learning. ### Method Overview - **Initial Modeling**: First, the base model is trained using a small seed dataset through supervised fine-tuning (SFT). - **Self-Filtering**: Next, the model generates answers to unannotated questions and filters out high-quality answers by comparing the fine-tuned model's responses with the original model's responses. - **Self-Reinforcement**: Finally, differential performance optimization (DPO) is used to further learn from the high-quality answers, iteratively improving the model. ### Dataset Introduction - **PUZZLEBEN**: This dataset contains 25,147 annotated questions and 10,000 unannotated questions. These questions cover various types of tasks such as brainteasers, riddles, and logic problems. Each question is accompanied by a manually written explanation or rationale. - **Diversity and Complexity**: The average length of the questions is relatively long, and the explanations are more detailed, highlighting the uniqueness and necessity of PUZZLEBEN. ### Experimental Results - **Baseline Model Performance**: The paper presents the performance of different models on the PUZZLEBEN dataset, including standard prompting and zero-shot chain-of-thought (CoT) methods. - **Role of Human Explanations**: Experimental results show that fine-tuning using explanations from PUZZLEBEN can significantly improve model performance. - **Effect of Self-Reinforcement**: The self-reinforcement method leads to a notable performance improvement, especially after the second iteration. In summary, the goal of this paper is to explore how to enhance the reasoning ability of LLMs with minimal human intervention and to experimentally validate the effectiveness of the proposed method.

Optimizing Language Model's Reasoning Abilities with Weak Supervision

Weak-to-Strong Reasoning

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision

Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards

Learning to Reason via Self-Iterative Process Feedback for Small Language Models

ReFT: Reasoning with Reinforced Fine-Tuning

Large Language Models Can Self-Improve in Long-context Reasoning

Improving Language Model Reasoning with Self-motivated Learning

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

An Empirical Study of Data Ability Boundary in LLMs' Math Reasoning

MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs

On Memorization of Large Language Models in Logical Reasoning

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch

Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles

PuzzleBench: Can LLMs Solve Challenging First-Order Combinatorial Reasoning Problems?

P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains

Go Beyond The Obvious: Probing the gap of INFORMAL reasoning ability between Humanity and LLMs by Detective Reasoning Puzzle Benchmark

Exploring Mathematical Extrapolation of Large Language Models with Synthetic Data

Deconfounded Causality-aware Parameter-Efficient Fine-Tuning for Problem-Solving Improvement of LLMs