Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

Xuan Zhang,Chao Du,Tianyu Pang,Qian Liu,Wei Gao,Min Lin

2024-10-31

Abstract:The recent development of chain-of-thought (CoT) decoding has enabled large language models (LLMs) to generate explicit logical reasoning paths for complex problem-solving. However, research indicates that these paths are not always deliberate and optimal. The tree-of-thought (ToT) method employs tree-searching to extensively explore the reasoning space and find better reasoning paths that CoT decoding might overlook. This deliberation, however, comes at the cost of significantly increased inference complexity. In this work, we demonstrate that fine-tuning LLMs leveraging the search tree constructed by ToT allows CoT to achieve similar or better performance, thereby avoiding the substantial inference burden. This is achieved through Chain of Preference Optimization (CPO), where LLMs are fine-tuned to align each step of the CoT reasoning paths with those of ToT using the inherent preference information in the tree-search process. Extensive experimental results show that CPO significantly improves LLM performance in solving a variety of complex problems, including question answering, fact verification, and arithmetic reasoning, demonstrating its effectiveness. Our code is available at <a class="link-external link-https" href="https://github.com/sail-sg/CPO" rel="external noopener nofollow">this https URL</a>.

Computation and Language,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to improve the chain - of - thought (CoT) reasoning ability of large - language models (LLMs) while maintaining efficiency. Specifically, the paper points out that although the tree - of - thought (ToT) method can find better reasoning paths through tree search, thereby improving the quality of reasoning, this method significantly increases the computational complexity and limits its practical application. Therefore, the researchers propose a new method - chain - of - preference optimization (CPO), which aims to improve CoT by using the preference information generated during the ToT process, enabling LLMs to achieve similar or better performance to ToT when solving various complex problems, while avoiding the high reasoning burden brought by ToT. The core of the CPO method is to construct preference pairs in the ToT search tree and use the direct preference optimization (DPO) algorithm to train LLMs, making their reasoning paths closer to the optimal paths discovered by ToT. The experimental results show that CPO not only significantly improves the performance of LLMs on tasks such as question answering, fact verification, and arithmetic reasoning, but also maintains a low reasoning delay. It is on average 57.5 times faster than ToT, and even outperforms ToT in some tasks. This proves that CPO can improve the reasoning ability of LLMs without sacrificing efficiency.

Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

DialCoT Meets PPO: Decomposing and Exploring Reasoning Paths in Smaller Language Models

Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation

Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding

ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting

TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees

Supervised Chain of Thought

Generating Chain-of-Thoughts with a Pairwise-Comparison Approach to Searching for the Most Promising Intermediate Thought

Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data

CSCE: Boosting LLM Reasoning by Simultaneous Enhancing of Casual Significance and Consistency

Constrained Reasoning Chains for Enhancing Theory-of-Mind in Large Language Models

Beyond Chain-of-Thought: A Survey of Chain-of-X Paradigms for LLMs

Chain-of-Thought Reasoning Without Prompting

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

How Likely Do LLMs with CoT Mimic Human Reasoning?

Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning

Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters

Faithful Logical Reasoning via Symbolic Chain-of-Thought

Nash CoT: Multi-Path Inference with Preference Equilibrium