Abstract:Large Language Models (LLMs) have excelled in multi-hop question-answering (M-QA) due to their advanced reasoning abilities. However, the impact of the inherent reasoning structures on LLM M-QA performance remains unclear, largely due to the absence of QA datasets that provide fine-grained reasoning structures. To address this gap, we introduce the Graph Reasoning-Structured Question Answering Dataset (GRS-QA), which includes both semantic contexts and reasoning structures for QA pairs. Unlike existing M-QA datasets, where different reasoning structures are entangled together, GRS-QA explicitly captures intricate reasoning pathways by constructing reasoning graphs, where nodes represent textual contexts and edges denote logical flows. These reasoning graphs of different structures enable a fine-grained evaluation of LLM reasoning capabilities across various reasoning structures. Our empirical analysis reveals that LLMs perform differently when handling questions with varying reasoning structures. This finding facilitates the exploration of textual structures as compared with semantics.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the insufficient performance evaluation of existing large - scale language models (LLMs) in multi - hop question - answering (M - QA) tasks due to the lack of an explicit reasoning structure. Specifically: 1. **Unclear reasoning structure**: Existing multi - hop question - answering datasets (such as HotpotQA, MuSiQue and 2WikiMultiHopQA) provide complex reasoning tasks, but do not provide an explicit reasoning structure for each question - answer pair. This makes LLMs unable to use pre - defined reasoning paths and can only rely on their internal knowledge for reasoning. 2. **Insufficient classification of question complexity**: Questions of different reasoning complexities in existing datasets are mixed together without classification, making it difficult to study in detail the performance of LLMs when dealing with different reasoning structures. To solve these problems, the author introduced a new dataset - **Graph Reasoning - Structured Question Answering Dataset (GRS - QA)**. This dataset explicitly captures complex reasoning paths by constructing reasoning graphs to represent the text context and logical flow in a graphical form. These reasoning graphs not only help to understand how LLMs should reason step by step to obtain the answer, but also allow researchers to evaluate more meticulously the performance of LLMs under different reasoning structures. ### Main contributions: 1. **The first question - answering dataset with reasoning graphs**: GRS - QA provides an explicit reasoning graph for each question - answer pair, making the reasoning process transparent and facilitating researchers to locate the specific difficulties of the model. 2. **Comprehensive analysis and classification of reasoning graphs**: Each question - answer pair is accompanied by detailed metadata, such as reasoning type and complexity, which is helpful for performance analysis according to the reasoning structure and provides new insights. 3. **Generation of negative reasoning graphs**: In addition to the real reasoning graphs, negative reasoning graphs with structural perturbations are also generated to specifically study the influence of structure on reasoning and question - answering performance. Through these improvements, GRS - QA can better evaluate and improve the performance of LLMs in complex reasoning tasks.

GRS-QA -- Graph Reasoning-Structured Question Answering Dataset

ReasoningLM: Enabling Structural Subgraph Reasoning in Pre-trained Language Models for Question Answering over Knowledge Graph

QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering

Right for Right Reasons: Large Language Models for Verifiable Commonsense Knowledge Graph Question Answering

DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain Question Answering over Knowledge Base and Text

CR-LT-KGQA: A Knowledge Graph Question Answering Dataset Requiring Commonsense Reasoning and Long-Tail Knowledge

GraphextQA: A Benchmark for Evaluating Graph-Enhanced Large Language Models

Debate on Graph: a Flexible and Reliable Reasoning Framework for Large Language Models

CRT-QA: A Dataset of Complex Reasoning Question Answering over Tabular Data

SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs

Analysis of the Reasoning with Redundant Information Provided Ability of Large Language Models

CuriousLLM: Elevating Multi-Document QA with Reasoning-Infused Knowledge Graph Prompting

Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models

GNN-RAG: Graph Neural Retrieval for Large Language Model Reasoning

Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought

Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning

SciQAG: A Framework for Auto-Generated Science Question Answering Dataset with Fine-grained Evaluation

On Evaluating the Integration of Reasoning and Action in LLM Agents with Database Question Answering

GraphEval2000: Benchmarking and Improving Large Language Models on Graph Datasets

Enhancing Logical Reasoning in Large Language Models through Graph-based Synthetic Data

NOAHQA: Numerical Reasoning with Interpretable Graph Question Answering Dataset