OCEAN: Offline Chain-of-thought Evaluation and Alignment in Large Language Models

Junda Wu,Xintong Li,Ruoyu Wang,Yu Xia,Yuxin Xiong,Jianing Wang,Tong Yu,Xiang Chen,Branislav Kveton,Lina Yao,Jingbo Shang,Julian McAuley
2024-10-31
Abstract:Offline evaluation of LLMs is crucial in understanding their capacities, though current methods remain underexplored in existing research. In this work, we focus on the offline evaluation of the chain-of-thought capabilities and show how to optimize LLMs based on the proposed evaluation method. To enable offline feedback with rich knowledge and reasoning paths, we use knowledge graphs (e.g., Wikidata5m) to provide feedback on the generated chain of thoughts. Due to the heterogeneity between LLM reasoning and KG structures, direct interaction and feedback from KGs on LLM behavior are challenging, as they require accurate entity linking and grounding of LLM-generated chains of thought in the KG. To address the above challenge, we propose an offline chain-of-thought evaluation framework, OCEAN, which models chain-of-thought reasoning in LLMs as an MDP and evaluate the policy's alignment with KG preference modeling. To overcome the reasoning heterogeneity and grounding problems, we leverage on-policy KG exploration and RL to model a KG policy that generates token-level likelihood distributions for LLM-generated chain-of-thought reasoning paths, simulating KG reasoning preference. Then we incorporate the knowledge-graph feedback on the validity and alignment of the generated reasoning paths into inverse propensity scores and propose KG-IPS estimator. Theoretically, we prove the unbiasedness of the proposed KG-IPS estimator and provide a lower bound on its variance. With the off-policy evaluated value function, we can directly enable off-policy optimization to further enhance chain-of-thought alignment. Our empirical study shows that OCEAN can be efficiently optimized for generating chain-of-thought reasoning paths with higher estimated values without affecting LLMs' general abilities in downstream tasks or their internal knowledge.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The problems that this paper attempts to solve are several key challenges in large - language models (LLMs) when generating multi - step reasoning chains (chain - of - thought, CoT). Specifically, these problems include: 1. **Limitations of Offline Evaluation**: Currently, offline evaluation methods for LLMs mainly focus on areas such as recommendation systems and healthcare, while the evaluation of LLMs' multi - step reasoning abilities has been relatively less explored. Online experiments are costly, risky and impractical, so effective offline evaluation methods are needed to understand the capabilities of LLMs. 2. **Limitations of Human Feedback**: Although human feedback can be used to align the behavior of LLMs to conform to human preferences, this feedback is costly to collect, and when it comes to multi - step reasoning, due to the diversity and complexity of knowledge backgrounds, human feedback may not be comprehensive and accurate enough. 3. **Heterogeneity between Knowledge Graphs and LLM Reasoning**: There are significant differences between the structure of knowledge graphs and the reasoning processes of LLMs. It is challenging to directly obtain feedback from knowledge graphs and apply it to the optimization of LLMs. This requires solving the problems of entity linking and alignment of reasoning paths. To solve the above problems, the paper proposes a new framework, OCEAN (Offline Chain - of - Thought Evaluation and Alignment in Large Language Models via Knowledge Graph Exploration), which is achieved through the following methods: - **Markov Decision Process Modeling**: Model the multi - step reasoning process of LLMs as a Markov decision process (MDP), and generate token - level likelihood distributions through knowledge graph exploration to simulate the reasoning preferences of knowledge graphs. - **Knowledge Graph Feedback**: Utilize knowledge graphs to provide feedback on the effectiveness and alignment of the generated reasoning paths, and integrate this feedback into the model's optimization process through the inverse propensity score (IPS) estimator. - **Unbiased Estimation and Variance Analysis**: Theoretically prove the unbiasedness of the proposed KG - IPS estimator, and provide a lower bound for its variance to ensure the reliability of the estimation. - **Direct Policy Optimization**: Directly optimize the policy of LLMs by maximizing the estimated policy value, improve their multi - step reasoning abilities, while maintaining their generalization abilities and generation quality in downstream tasks. Through these methods, OCEAN aims to effectively evaluate and optimize the multi - step reasoning abilities of LLMs, make them more in line with the reasoning logic of knowledge graphs, and thus improve their performance in various tasks.