Seemingly Plausible Distractors in Multi-Hop Reasoning: Are Large Language Models Attentive Readers?

Neeladri Bhuiya,Viktor Schlegel,Stefan Winkler
2024-10-31
Abstract:State-of-the-art Large Language Models (LLMs) are accredited with an increasing number of different capabilities, ranging from reading comprehension, over advanced mathematical and reasoning skills to possessing scientific knowledge. In this paper we focus on their multi-hop reasoning capability: the ability to identify and integrate information from multiple textual sources. Given the concerns with the presence of simplifying cues in existing multi-hop reasoning benchmarks, which allow models to circumvent the reasoning requirement, we set out to investigate, whether LLMs are prone to exploiting such simplifying cues. We find evidence that they indeed circumvent the requirement to perform multi-hop reasoning, but they do so in more subtle ways than what was reported about their fine-tuned pre-trained language model (PLM) predecessors. Motivated by this finding, we propose a challenging multi-hop reasoning benchmark, by generating seemingly plausible multi-hop reasoning chains, which ultimately lead to incorrect answers. We evaluate multiple open and proprietary state-of-the-art LLMs, and find that their performance to perform multi-hop reasoning is affected, as indicated by up to 45% relative decrease in F1 score when presented with such seemingly plausible alternatives. We conduct a deeper analysis and find evidence that while LLMs tend to ignore misleading lexical cues, misleading reasoning paths indeed present a significant challenge.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Are large - language models (LLMs) vulnerable to simplified cues in multi - hop reasoning tasks, thus bypassing the real reasoning requirements? Specifically, the author focuses on the simplified cues existing in some existing multi - hop reasoning benchmark tests. These cues allow the model to obtain answers without performing complex reasoning. Therefore, they propose a challenging multi - hop reasoning benchmark test to evaluate the reasoning ability of LLMs by generating multi - hop reasoning chains that seem reasonable but ultimately lead to wrong answers. To verify this, the author designed a method that can generate text paragraphs containing misleading reasoning paths. These paths seem reasonable but will actually lead to wrong answers. In this way, the author hopes to more accurately evaluate the performance of LLMs when facing complex reasoning tasks, especially when these tasks contain multiple seemingly reasonable reasoning paths. The research found that although LLMs are more complex and powerful than their predecessors, pre - trained language models (PLMs), they are still affected by misleading reasoning paths, but in a more subtle way. For example, the experimental results show that when LLMs face problems containing misleading reasoning paths, their F1 scores relatively decrease by up to 45%, indicating that these models still have certain limitations when handling complex reasoning tasks. Overall, the main purpose of this paper is to explore and evaluate the real capabilities and limitations of LLMs in multi - hop reasoning tasks, especially their performance when facing problems that are very cleverly designed and can lead the model to wrong conclusions. This research is of great significance for understanding the capacity boundaries of LLMs and how to further improve these models.