Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style

Yuepei Li,Kang Zhou,Qiao Qiao,Bach Nguyen,Qing Wang,Qi Li
2024-09-17
Abstract:Retrieval-augmented generation (RAG) improves Large Language Models (LLMs) by incorporating external information into the response generation process. However, how context-faithful LLMs are and what factors influence LLMs' context-faithfulness remain largely unexplored. In this study, we investigate the impact of memory strength and evidence presentation on LLMs' receptiveness to external evidence. We introduce a method to quantify the memory strength of LLMs by measuring the divergence in LLMs' responses to different paraphrases of the same question, which is not considered by previous works. We also generate evidence in various styles to evaluate the effects of evidence in different styles. Two datasets are used for evaluation: Natural Questions (NQ) with popular questions and popQA featuring long-tail questions. Our results show that for questions with high memory strength, LLMs are more likely to rely on internal memory, particularly for larger LLMs such as GPT-4. On the other hand, presenting paraphrased evidence significantly increases LLMs' receptiveness compared to simple repetition or adding details.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem this paper attempts to address is the issue of context-faithfulness in large language models (LLMs) when handling external information. Specifically, the researchers focus on: 1. **The impact of memory strength on context-faithfulness**: The researchers explore the extent to which LLMs accept external evidence under different memory strengths. They find that for questions with higher memory strength, LLMs are more inclined to rely on internal memory, especially in large models like GPT-4. 2. **The impact of evidence style on context-faithfulness**: The researchers also study the effect of different styles of evidence (such as direct evidence, indirect evidence, restatements, and rewrites of evidence) on LLMs' acceptance of external information. The results show that rewriting direct evidence can significantly increase LLMs' acceptance of external evidence. ### Main Conclusions: - **The relationship between memory strength and context-faithfulness**: The stronger the memory strength of LLMs, the more likely they are to rely on internal memory. This trend is particularly evident in the NQ dataset, especially in large models like GPT-4 and ChatGPT. - **The impact of evidence style**: Simply repeating direct evidence is ineffective for most models, but rewriting direct evidence is very effective and can significantly increase LLMs' acceptance of external evidence. Additionally, combining direct and indirect evidence can also enhance LLMs' context-faithfulness. ### Research Methods: - **Datasets**: The study uses two datasets, one being popQA, which contains long-tail questions, and the other being Natural Questions (NQ), which contains popular questions. - **Quantifying memory strength**: Memory strength is quantified by measuring the consistency of LLMs' responses to different rewritten versions of the same question. - **Evidence generation**: Different styles of evidence are generated, including direct evidence, indirect evidence, restatements, and rewrites. ### Experimental Setup: - **Models**: The experiments use four well-known language models: ChatGPT, GPT-4, LLaMA2.7B, and LLaMA2.70B. - **Evaluation metrics**: Free-form Q&A is converted into multiple-choice questions, and the proportion of each question selecting memory answers (MA), counter-memory answers (CMA), and uncertain (UCT) is calculated. ### Experimental Results: - **The impact of memory strength on different datasets**: The memory strength of the NQ dataset is generally higher than that of the popQA dataset, and the larger the model, the higher the memory strength. - **The relationship between memory strength and context-faithfulness**: The higher the memory strength, the higher the proportion of LLMs selecting memory answers, and the lower the proportion of selecting counter-memory answers. - **The impact of evidence style**: Rewriting direct evidence can significantly reduce the proportion of LLMs selecting memory answers and increase their acceptance of external evidence. Through this research, the authors provide valuable insights into understanding the context-faithfulness of LLMs and how to improve their acceptance of external information.