Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability

Zicheng Lin,Tian Liang,Jiahao Xu,Xing Wang,Ruilin Luo,Chufan Shi,Siheng Li,Yujiu Yang,Zhaopeng Tu
2024-12-02
Abstract:Large Language Models (LLMs) have exhibited remarkable performance on reasoning tasks. They utilize autoregressive token generation to construct reasoning trajectories, enabling the development of a coherent chain of thought. In this work, we explore the impact of individual tokens on the final outcomes of reasoning tasks. We identify the existence of ``critical tokens'' that lead to incorrect reasoning trajectories in LLMs. Specifically, we find that LLMs tend to produce positive outcomes when forced to decode other tokens instead of critical tokens. Motivated by this observation, we propose a novel approach - cDPO - designed to automatically recognize and conduct token-level rewards for the critical tokens during the alignment process. Specifically, we develop a contrastive estimation approach to automatically identify critical tokens. It is achieved by comparing the generation likelihood of positive and negative models. To achieve this, we separately fine-tune the positive and negative models on various reasoning trajectories, consequently, they are capable of identifying identify critical tokens within incorrect trajectories that contribute to erroneous outcomes. Moreover, to further align the model with the critical token information during the alignment process, we extend the conventional DPO algorithms to token-level DPO and utilize the differential likelihood from the aforementioned positive and negative model as important weight for token-level DPO <a class="link-external link-http" href="http://learning.Experimental" rel="external noopener nofollow">this http URL</a> results on GSM8K and MATH500 benchmarks with two-widely used models Llama-3 (8B and 70B) and deepseek-math (7B) demonstrate the effectiveness of the propsoed approach cDPO.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve the key problems encountered by large - language models (LLMs) in reasoning tasks, that is, certain specific words (called "critical tokens") generated by the model will lead to wrong reasoning paths. The authors found that by avoiding these critical tokens, the reasoning accuracy of the model can be significantly improved. To this end, they proposed a new method - cDPO (Contrastive Estimation - enhanced Direct Preference Optimization), which aims to automatically identify and reward critical tokens to improve the performance of the model in reasoning tasks. Specifically, the main contributions of the paper include: 1. **Identifying critical tokens**: The authors discovered the "critical tokens" in the reasoning trajectory, which are the root causes of errors in reasoning tasks. 2. **Proposing the cDPO algorithm**: This is a token - level preference optimization algorithm that can automatically identify and provide rewards for "critical tokens". 3. **Experimentally proving effectiveness**: The experimental results show that the proposed cDPO method outperforms the existing baseline strategies on multiple benchmarks with statistical significance (p < 0.005). Through these contributions, the paper provides new ideas and methods for improving the accuracy and robustness of large - language models in complex reasoning tasks.