Abstract:Retrieval-augmented generation (RAG) improves Large Language Models (LLMs) by incorporating external information into the response generation process. However, how context-faithful LLMs are and what factors influence LLMs' context-faithfulness remain largely unexplored. In this study, we investigate the impact of memory strength and evidence presentation on LLMs' receptiveness to external evidence. We introduce a method to quantify the memory strength of LLMs by measuring the divergence in LLMs' responses to different paraphrases of the same question, which is not considered by previous works. We also generate evidence in various styles to evaluate the effects of evidence in different styles. Two datasets are used for evaluation: Natural Questions (NQ) with popular questions and popQA featuring long-tail questions. Our results show that for questions with high memory strength, LLMs are more likely to rely on internal memory, particularly for larger LLMs such as GPT-4. On the other hand, presenting paraphrased evidence significantly increases LLMs' receptiveness compared to simple repetition or adding details.

What problem does this paper attempt to address?

The problem this paper attempts to address is the issue of context-faithfulness in large language models (LLMs) when handling external information. Specifically, the researchers focus on: 1. **The impact of memory strength on context-faithfulness**: The researchers explore the extent to which LLMs accept external evidence under different memory strengths. They find that for questions with higher memory strength, LLMs are more inclined to rely on internal memory, especially in large models like GPT-4. 2. **The impact of evidence style on context-faithfulness**: The researchers also study the effect of different styles of evidence (such as direct evidence, indirect evidence, restatements, and rewrites of evidence) on LLMs' acceptance of external information. The results show that rewriting direct evidence can significantly increase LLMs' acceptance of external evidence. ### Main Conclusions: - **The relationship between memory strength and context-faithfulness**: The stronger the memory strength of LLMs, the more likely they are to rely on internal memory. This trend is particularly evident in the NQ dataset, especially in large models like GPT-4 and ChatGPT. - **The impact of evidence style**: Simply repeating direct evidence is ineffective for most models, but rewriting direct evidence is very effective and can significantly increase LLMs' acceptance of external evidence. Additionally, combining direct and indirect evidence can also enhance LLMs' context-faithfulness. ### Research Methods: - **Datasets**: The study uses two datasets, one being popQA, which contains long-tail questions, and the other being Natural Questions (NQ), which contains popular questions. - **Quantifying memory strength**: Memory strength is quantified by measuring the consistency of LLMs' responses to different rewritten versions of the same question. - **Evidence generation**: Different styles of evidence are generated, including direct evidence, indirect evidence, restatements, and rewrites. ### Experimental Setup: - **Models**: The experiments use four well-known language models: ChatGPT, GPT-4, LLaMA2.7B, and LLaMA2.70B. - **Evaluation metrics**: Free-form Q&A is converted into multiple-choice questions, and the proportion of each question selecting memory answers (MA), counter-memory answers (CMA), and uncertain (UCT) is calculated. ### Experimental Results: - **The impact of memory strength on different datasets**: The memory strength of the NQ dataset is generally higher than that of the popQA dataset, and the larger the model, the higher the memory strength. - **The relationship between memory strength and context-faithfulness**: The higher the memory strength, the higher the proportion of LLMs selecting memory answers, and the lower the proportion of selecting counter-memory answers. - **The impact of evidence style**: Rewriting direct evidence can significantly reduce the proportion of LLMs selecting memory answers and increase their acceptance of external evidence. Through this research, the authors provide valuable insights into understanding the context-faithfulness of LLMs and how to improve their acceptance of external information.

Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style

Enhancing Large Language Models' Situated Faithfulness to External Contexts

RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation

Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts

What External Knowledge is Preferred by LLMs? Characterizing and Exploring Chain of Evidence in Imperfect Context

Context-faithful Prompting for Large Language Models

LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering

Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism

FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows"

When Context Leads but Parametric Memory Follows in Large Language Models

Retrieving Supporting Evidence for LLMs Generated Answers

Context Matter: Data-Efficient Augmentation of Large Language Models for Scientific Applications

Retrieval meets Long Context Large Language Models

KS-LLM: Knowledge Selection of Large Language Models with Evidence Document for Question Answering

ALR$^2$: A Retrieve-then-Reason Framework for Long-context Question Answering

Attribute or Abstain: Large Language Models as Long Document Assistants

Are Large Language Models Good at Utility Judgments?

Large Language Models Can Self-Improve in Long-context Reasoning

Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers

Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach