Abstract:While LLMs excel at multi-hop questions (e.g. "Who is the spouse of the performer of Imagine?") when using chain-of-thought reasoning (CoT), they struggle when forced to reason internally (without CoT). Previous work on the size and nature of this gap produced mixed evidence with inconclusive results. In this paper, we introduce a controlled setting for investigating two-hop reasoning in LLMs, where the above-chance performance constitutes undeniable evidence for latent reasoning. We fine-tune LLMs (including Llama 3 8B Instruct and GPT-4o) on fictional facts and confirm that they generalize to answering two-hop questions about them using CoT. We find that models can perform latent reasoning when facts appear together during training or in the prompt. However, to our surprise, models completely fail at two-hop reasoning without CoT when learned facts only appear in different documents, achieving chance-level accuracy and chance-level test loss. We call this complete failure to compose separately learned facts the Two-Hop Curse. Moreover, we evaluate 9 frontier LLMs on real-world facts, finding that models completely fail at two-hop no-CoT reasoning for over half of question categories while maintaining partial success with CoT across most categories. These results suggest that LLMs lack a general capability for latent multi-hop reasoning independent of the question type.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper mainly explores the limitations of large language models (LLMs) in handling two - hop reasoning, especially their performance when forced to perform internal reasoning without using chain - of - thought (CoT). Specifically, the paper attempts to answer the following questions: 1. **Can LLMs successfully perform two - hop reasoning without CoT?** - The paper studies the ability of LLMs to perform two - hop reasoning in different situations (such as when facts appear in the same document or different documents) through controlled experimental settings. 2. **Is the two - hop reasoning ability of LLMs limited by the way facts are learned?** - The research finds that when facts appear in different training documents, LLMs are completely unable to perform two - hop reasoning, reaching the accuracy and loss at the random level. The author calls this phenomenon the "Two - Hop Curse". 3. **How do LLMs perform two - hop reasoning on real - world knowledge?** - The author evaluates the two - hop reasoning ability of 9 cutting - edge LLMs on real - world knowledge and finds that these models are completely unable to perform two - hop reasoning without CoT in more than half of the question categories, while showing partial success when CoT is present. 4. **Can the limitations of two - hop reasoning be overcome by controlling the internal structure of the model?** - The author attempts two interventions: forcing facts to be stored in the correct layer order and providing activation - level supervision, but these interventions fail to significantly improve the two - hop reasoning ability of LLMs. ### Main contributions 1. **Proposing a clean experimental setup**: - By using fictional facts for fine - tuning, it is ensured that high accuracy can only be attributed to successful two - hop reasoning, rather than memory or reasoning shortcuts. 2. **Discovering the "Two - Hop Curse" phenomenon**: - When facts are learned from separate training documents, LLMs are completely unable to perform two - hop reasoning without CoT, reaching the accuracy and loss at the random level. 3. **Demonstrating the two - hop reasoning ability of LLMs in specific settings**: - When facts appear in the same document or prompt, LLMs can successfully perform two - hop reasoning, indicating that the two - hop curse mainly affects the combination of separately learned facts. 4. **Evaluating the real - world two - hop reasoning ability of cutting - edge LLMs**: - It is found that these models are completely unable to perform two - hop reasoning without CoT in more than half of the question categories, but show partial success when CoT is present. 5. **Attempting multiple interventions to overcome the limitations of two - hop reasoning**: - Including forcing facts to be stored in the correct layer order and providing activation - level supervision, but these interventions have not achieved significant results. ### Conclusion The research results of the paper show that current LLMs have fundamental limitations in handling two - hop reasoning, especially in the absence of CoT, which may be due to their lack of a general implicit multi - step reasoning ability. Understanding these limitations is crucial for developing more powerful LLMs.

The Two-Hop Curse: LLMs trained on A->B, B->C fail to learn A-->C

Seemingly Plausible Distractors in Multi-Hop Reasoning: Are Large Language Models Attentive Readers?

How Likely Do LLMs with CoT Mimic Human Reasoning?

Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?

Do Large Language Models Latently Perform Multi-Hop Reasoning?

Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs

Towards Faithful Chain-of-Thought: Large Language Models are Bridging Reasoners

Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs

Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries

Self-prompted Chain-of-Thought on Large Language Models for Open-domain Multi-hop Reasoning

How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning

Do LLMs Really Think Step-by-step In Implicit Reasoning?

On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models

Understanding Chain-of-Thought in LLMs through Information Theory

Can LLMs Learn from Mistakes? an Empirical Study on Reasoning Tasks

A Hopfieldian View-based Interpretation for Chain-of-Thought Reasoning

Can LLMs perform structured graph reasoning?

Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation

Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning

Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Reasoning

Chain-of-Thought Reasoning Without Prompting