Abstract:State-of-the-art Large Language Models (LLMs) are accredited with an increasing number of different capabilities, ranging from reading comprehension, over advanced mathematical and reasoning skills to possessing scientific knowledge. In this paper we focus on their multi-hop reasoning capability: the ability to identify and integrate information from multiple textual sources. Given the concerns with the presence of simplifying cues in existing multi-hop reasoning benchmarks, which allow models to circumvent the reasoning requirement, we set out to investigate, whether LLMs are prone to exploiting such simplifying cues. We find evidence that they indeed circumvent the requirement to perform multi-hop reasoning, but they do so in more subtle ways than what was reported about their fine-tuned pre-trained language model (PLM) predecessors. Motivated by this finding, we propose a challenging multi-hop reasoning benchmark, by generating seemingly plausible multi-hop reasoning chains, which ultimately lead to incorrect answers. We evaluate multiple open and proprietary state-of-the-art LLMs, and find that their performance to perform multi-hop reasoning is affected, as indicated by up to 45% relative decrease in F1 score when presented with such seemingly plausible alternatives. We conduct a deeper analysis and find evidence that while LLMs tend to ignore misleading lexical cues, misleading reasoning paths indeed present a significant challenge.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Are large - language models (LLMs) vulnerable to simplified cues in multi - hop reasoning tasks, thus bypassing the real reasoning requirements? Specifically, the author focuses on the simplified cues existing in some existing multi - hop reasoning benchmark tests. These cues allow the model to obtain answers without performing complex reasoning. Therefore, they propose a challenging multi - hop reasoning benchmark test to evaluate the reasoning ability of LLMs by generating multi - hop reasoning chains that seem reasonable but ultimately lead to wrong answers. To verify this, the author designed a method that can generate text paragraphs containing misleading reasoning paths. These paths seem reasonable but will actually lead to wrong answers. In this way, the author hopes to more accurately evaluate the performance of LLMs when facing complex reasoning tasks, especially when these tasks contain multiple seemingly reasonable reasoning paths. The research found that although LLMs are more complex and powerful than their predecessors, pre - trained language models (PLMs), they are still affected by misleading reasoning paths, but in a more subtle way. For example, the experimental results show that when LLMs face problems containing misleading reasoning paths, their F1 scores relatively decrease by up to 45%, indicating that these models still have certain limitations when handling complex reasoning tasks. Overall, the main purpose of this paper is to explore and evaluate the real capabilities and limitations of LLMs in multi - hop reasoning tasks, especially their performance when facing problems that are very cleverly designed and can lead the model to wrong conclusions. This research is of great significance for understanding the capacity boundaries of LLMs and how to further improve these models.

Seemingly Plausible Distractors in Multi-Hop Reasoning: Are Large Language Models Attentive Readers?

Concise and Organized Perception Facilitates Large Language Models for Deductive Reasoning.

Do Large Language Models Latently Perform Multi-Hop Reasoning?

Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?

Large Language Models Still Face Challenges in Multi-Hop Reasoning with External Knowledge

Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs

Unleashing Multi-Hop Reasoning Potential in Large Language Models through Repetition of Misordered Context

Distributional reasoning in LLMs: Parallel reasoning processes in multi-hop reasoning

Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought

Concise and Organized Perception Facilitates Reasoning in Large Language Models

Towards Interpreting Language Models: A Case Study in Multi-Hop Reasoning

Preventing Language Models From Hiding Their Reasoning

Reasoning with Large Language Models, a Survey

The Two-Hop Curse: LLMs trained on A->B, B->C fail to learn A-->C

Hint Marginalization for Improved Reasoning in Large Language Models

Deceptive Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination?

Can Large Language Models Reason? A Characterization via 3-SAT

Break the Chain: Large Language Models Can be Shortcut Reasoners

A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences

Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models