GFlowNet Fine-tuning for Diverse Correct Solutions in Mathematical Reasoning Tasks

Ryoichi Takase,Masaya Tsunokake,Yuta Tsuchiya,Shota Inuzuka
2024-10-26
Abstract:Mathematical reasoning problems are among the most challenging, as they typically require an understanding of fundamental laws to solve. The laws are universal, but the derivation of the final answer changes depending on how a problem is approached. When training large language models (LLMs), learning the capability of generating such multiple solutions is essential to accelerate their use in mathematical education. To this end, we train LLMs using generative flow network (GFlowNet). Different from reward-maximizing reinforcement learning (RL), GFlowNet fine-tuning seeks to find diverse solutions by training the LLM whose distribution is proportional to a reward function. In numerical experiments, we evaluate GFlowNet fine-tuning and reward-maximizing RL in terms of accuracy and diversity. The results show that GFlowNet fine-tuning derives correct final answers from diverse intermediate reasoning steps, indicating the improvement of the capability of alternative solution generation.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to train large - language models (LLMs) to generate diverse correct solutions in mathematical reasoning tasks. Specifically, the author focuses on: 1. **Increasing the diversity of solutions**: In mathematics education, encouraging students to find multiple solutions from different perspectives helps improve their problem - solving ability and creativity. Therefore, researchers hope to train LLMs to be able to generate diverse solutions, not just find one correct answer. 2. **Maintaining the correctness of the final answer**: Although the solutions can be diverse, the final answer must be correct. This means that it is necessary to ensure that the model can still obtain the correct final result during the process of generating different solutions. To achieve this goal, the author uses the Generative Flow Network (GFlowNet) to fine - tune LLMs. Different from the traditional reinforcement learning (RL) method based on reward maximization, GFlowNet can generate diverse high - reward sequences by adjusting the probability distribution of the model to be proportional to the reward function. This gives GFlowNet a potential advantage in mathematical reasoning tasks, enabling it to generate diverse solutions while maintaining the correctness of the final answer. ### Research questions The paper mainly explores the following two research questions: - **Can GFlowNet generate diverse solutions in mathematical reasoning tasks?** - **What are the differences in accuracy between GFlowNet and the RL method based on reward maximization?** Through experiments, the author evaluated the performance of GFlowNet in terms of accuracy and diversity and compared it with several RL methods based on reward maximization. The experimental results show that GFlowNet can generate more diverse solutions while maintaining the correctness of the final answer.