Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability

Zicheng Lin,Tian Liang,Jiahao Xu,Xing Wang,Ruilin Luo,Chufan Shi,Siheng Li,Yujiu Yang,Zhaopeng Tu

2024-12-02

Abstract:Large Language Models (LLMs) have exhibited remarkable performance on reasoning tasks. They utilize autoregressive token generation to construct reasoning trajectories, enabling the development of a coherent chain of thought. In this work, we explore the impact of individual tokens on the final outcomes of reasoning tasks. We identify the existence of ``critical tokens'' that lead to incorrect reasoning trajectories in LLMs. Specifically, we find that LLMs tend to produce positive outcomes when forced to decode other tokens instead of critical tokens. Motivated by this observation, we propose a novel approach - cDPO - designed to automatically recognize and conduct token-level rewards for the critical tokens during the alignment process. Specifically, we develop a contrastive estimation approach to automatically identify critical tokens. It is achieved by comparing the generation likelihood of positive and negative models. To achieve this, we separately fine-tune the positive and negative models on various reasoning trajectories, consequently, they are capable of identifying identify critical tokens within incorrect trajectories that contribute to erroneous outcomes. Moreover, to further align the model with the critical token information during the alignment process, we extend the conventional DPO algorithms to token-level DPO and utilize the differential likelihood from the aforementioned positive and negative model as important weight for token-level DPO <a class="link-external link-http" href="http://learning.Experimental" rel="external noopener nofollow">this http URL</a> results on GSM8K and MATH500 benchmarks with two-widely used models Llama-3 (8B and 70B) and deepseek-math (7B) demonstrate the effectiveness of the propsoed approach cDPO.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

This paper attempts to solve the key problems encountered by large - language models (LLMs) in reasoning tasks, that is, certain specific words (called "critical tokens") generated by the model will lead to wrong reasoning paths. The authors found that by avoiding these critical tokens, the reasoning accuracy of the model can be significantly improved. To this end, they proposed a new method - cDPO (Contrastive Estimation - enhanced Direct Preference Optimization), which aims to automatically identify and reward critical tokens to improve the performance of the model in reasoning tasks. Specifically, the main contributions of the paper include: 1. **Identifying critical tokens**: The authors discovered the "critical tokens" in the reasoning trajectory, which are the root causes of errors in reasoning tasks. 2. **Proposing the cDPO algorithm**: This is a token - level preference optimization algorithm that can automatically identify and provide rewards for "critical tokens". 3. **Experimentally proving effectiveness**: The experimental results show that the proposed cDPO method outperforms the existing baseline strategies on multiple benchmarks with statistical significance (p < 0.005). Through these contributions, the paper provides new ideas and methods for improving the accuracy and robustness of large - language models in complex reasoning tasks.

Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability

A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners

Extending Token Computation for LLM Reasoning

Token-level Direct Preference Optimization

Enhancing Mathematical Reasoning in LLMs by Stepwise Correction

Token-Supervised Value Models for Enhancing Mathematical Reasoning Capabilities of Large Language Models

Improving Self Consistency in LLMs through Probabilistic Tokenization

Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model

LLM2: Let Large Language Models Harness System 2 Reasoning

Are LLMs Rigorous Logical Reasoner? Empowering Natural Language Proof Generation with Contrastive Stepwise Decoding

DialCoT Meets PPO: Decomposing and Exploring Reasoning Paths in Smaller Language Models

Token-Budget-Aware LLM Reasoning

Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems

Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization

Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning

Making Large Language Models Better Reasoners with Alignment

Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning

Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-thoughts Critic

Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning

Keypoint-based Progressive Chain-of-Thought Distillation for LLMs