Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

Xuan Zhang,Chao Du,Tianyu Pang,Qian Liu,Wei Gao,Min Lin
2024-10-31
Abstract:The recent development of chain-of-thought (CoT) decoding has enabled large language models (LLMs) to generate explicit logical reasoning paths for complex problem-solving. However, research indicates that these paths are not always deliberate and optimal. The tree-of-thought (ToT) method employs tree-searching to extensively explore the reasoning space and find better reasoning paths that CoT decoding might overlook. This deliberation, however, comes at the cost of significantly increased inference complexity. In this work, we demonstrate that fine-tuning LLMs leveraging the search tree constructed by ToT allows CoT to achieve similar or better performance, thereby avoiding the substantial inference burden. This is achieved through Chain of Preference Optimization (CPO), where LLMs are fine-tuned to align each step of the CoT reasoning paths with those of ToT using the inherent preference information in the tree-search process. Extensive experimental results show that CPO significantly improves LLM performance in solving a variety of complex problems, including question answering, fact verification, and arithmetic reasoning, demonstrating its effectiveness. Our code is available at <a class="link-external link-https" href="https://github.com/sail-sg/CPO" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to improve the chain - of - thought (CoT) reasoning ability of large - language models (LLMs) while maintaining efficiency. Specifically, the paper points out that although the tree - of - thought (ToT) method can find better reasoning paths through tree search, thereby improving the quality of reasoning, this method significantly increases the computational complexity and limits its practical application. Therefore, the researchers propose a new method - chain - of - preference optimization (CPO), which aims to improve CoT by using the preference information generated during the ToT process, enabling LLMs to achieve similar or better performance to ToT when solving various complex problems, while avoiding the high reasoning burden brought by ToT. The core of the CPO method is to construct preference pairs in the ToT search tree and use the direct preference optimization (DPO) algorithm to train LLMs, making their reasoning paths closer to the optimal paths discovered by ToT. The experimental results show that CPO not only significantly improves the performance of LLMs on tasks such as question answering, fact verification, and arithmetic reasoning, but also maintains a low reasoning delay. It is on average 57.5 times faster than ToT, and even outperforms ToT in some tasks. This proves that CPO can improve the reasoning ability of LLMs without sacrificing efficiency.