Abstract:Offline evaluation of LLMs is crucial in understanding their capacities, though current methods remain underexplored in existing research. In this work, we focus on the offline evaluation of the chain-of-thought capabilities and show how to optimize LLMs based on the proposed evaluation method. To enable offline feedback with rich knowledge and reasoning paths, we use knowledge graphs (e.g., Wikidata5m) to provide feedback on the generated chain of thoughts. Due to the heterogeneity between LLM reasoning and KG structures, direct interaction and feedback from KGs on LLM behavior are challenging, as they require accurate entity linking and grounding of LLM-generated chains of thought in the KG. To address the above challenge, we propose an offline chain-of-thought evaluation framework, OCEAN, which models chain-of-thought reasoning in LLMs as an MDP and evaluate the policy's alignment with KG preference modeling. To overcome the reasoning heterogeneity and grounding problems, we leverage on-policy KG exploration and RL to model a KG policy that generates token-level likelihood distributions for LLM-generated chain-of-thought reasoning paths, simulating KG reasoning preference. Then we incorporate the knowledge-graph feedback on the validity and alignment of the generated reasoning paths into inverse propensity scores and propose KG-IPS estimator. Theoretically, we prove the unbiasedness of the proposed KG-IPS estimator and provide a lower bound on its variance. With the off-policy evaluated value function, we can directly enable off-policy optimization to further enhance chain-of-thought alignment. Our empirical study shows that OCEAN can be efficiently optimized for generating chain-of-thought reasoning paths with higher estimated values without affecting LLMs' general abilities in downstream tasks or their internal knowledge.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are several key challenges in large - language models (LLMs) when generating multi - step reasoning chains (chain - of - thought, CoT). Specifically, these problems include: 1. **Limitations of Offline Evaluation**: Currently, offline evaluation methods for LLMs mainly focus on areas such as recommendation systems and healthcare, while the evaluation of LLMs' multi - step reasoning abilities has been relatively less explored. Online experiments are costly, risky and impractical, so effective offline evaluation methods are needed to understand the capabilities of LLMs. 2. **Limitations of Human Feedback**: Although human feedback can be used to align the behavior of LLMs to conform to human preferences, this feedback is costly to collect, and when it comes to multi - step reasoning, due to the diversity and complexity of knowledge backgrounds, human feedback may not be comprehensive and accurate enough. 3. **Heterogeneity between Knowledge Graphs and LLM Reasoning**: There are significant differences between the structure of knowledge graphs and the reasoning processes of LLMs. It is challenging to directly obtain feedback from knowledge graphs and apply it to the optimization of LLMs. This requires solving the problems of entity linking and alignment of reasoning paths. To solve the above problems, the paper proposes a new framework, OCEAN (Offline Chain - of - Thought Evaluation and Alignment in Large Language Models via Knowledge Graph Exploration), which is achieved through the following methods: - **Markov Decision Process Modeling**: Model the multi - step reasoning process of LLMs as a Markov decision process (MDP), and generate token - level likelihood distributions through knowledge graph exploration to simulate the reasoning preferences of knowledge graphs. - **Knowledge Graph Feedback**: Utilize knowledge graphs to provide feedback on the effectiveness and alignment of the generated reasoning paths, and integrate this feedback into the model's optimization process through the inverse propensity score (IPS) estimator. - **Unbiased Estimation and Variance Analysis**: Theoretically prove the unbiasedness of the proposed KG - IPS estimator, and provide a lower bound for its variance to ensure the reliability of the estimation. - **Direct Policy Optimization**: Directly optimize the policy of LLMs by maximizing the estimated policy value, improve their multi - step reasoning abilities, while maintaining their generalization abilities and generation quality in downstream tasks. Through these methods, OCEAN aims to effectively evaluate and optimize the multi - step reasoning abilities of LLMs, make them more in line with the reasoning logic of knowledge graphs, and thus improve their performance in various tasks.

OCEAN: Offline Chain-of-thought Evaluation and Alignment in Large Language Models

Concise and Organized Perception Facilitates Large Language Models for Deductive Reasoning.

Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs

Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph

LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

LANE: Logic Alignment of Non-tuning Large Language Models and Online Recommendation Systems for Explainable Reason Generation

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance

Exchange-of-Thought: Enhancing Large Language Model Capabilities through Cross-Model Communication

OlaGPT: Empowering LLMs With Human-like Problem-Solving Abilities

Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources

Self-prompted Chain-of-Thought on Large Language Models for Open-domain Multi-hop Reasoning

Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning

Making Large Language Models Better Reasoners with Alignment

Alignment Between the Decision-Making Logic of LLMs and Human Cognition: A Case Study on Legal LLMs

Synergy-of-Thoughts: Eliciting Efficient Reasoning in Hybrid Language Models

OCEAN-MBRL: Offline Conservative Exploration for Model-Based Offline Reinforcement Learning

Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

Offline-to-Online Multi-Agent Reinforcement Learning with Offline Value Function Memory and Sequential Exploration

Concise and Organized Perception Facilitates Reasoning in Large Language Models

KG-Agent: An Efficient Autonomous Agent Framework for Complex Reasoning over Knowledge Graph