Abstract:Most large language models (LLMs) are trained once and never updated; thus, they lack the ability to dynamically adapt to our ever-changing world. In this work, we perform a detailed study of the factuality of LLM-generated text in the context of answering questions that test current world knowledge. Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Through human evaluations involving more than 50K judgments, we shed light on limitations of these models and demonstrate significant room for improvement: for instance, all models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA by incorporating relevant and up-to-date information retrieved from a search engine into the prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as <a class="link-external link-http" href="http://Perplexity.AI" rel="external noopener nofollow">this http URL</a>. Further analysis of FreshPrompt reveals that both the number of retrieved evidences and their order play a key role in influencing the correctness of LLM-generated answers. Additionally, instructing the LLM to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers. To facilitate future work, we release FreshQA at <a class="link-external link-http" href="http://github.com/freshllms/freshqa" rel="external noopener nofollow">this http URL</a> and commit to updating it at regular intervals.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the factual accuracy issue of current large - language models (LLMs) when answering questions that require the latest world knowledge. Specifically: 1. **Lack of Dynamic Adaptability**: Most large - language models are no longer updated once trained, which results in their lack of the ability to dynamically adapt to the ever - changing world, especially performing poorly when dealing with questions that require rapidly changing knowledge. 2. **Factual Accuracy Challenges**: The paper evaluated the performance of existing large - language models in answering questions testing current world knowledge by introducing a new dynamic question - answering benchmark named FRESH QA. FRESH QA contains various types of questions, including those requiring rapidly changing knowledge and those with wrong premises that need to be refuted. 3. **Hallucination Problem**: The research found that, regardless of the model size, all models have significant hallucination problems on questions involving rapidly changing knowledge and wrong premises, that is, generating seemingly reasonable but actually incorrect information. To address these challenges, the paper proposed a simple and effective method - FRESH PROMPT. This method significantly improved the performance of large - language models on FRESH QA by incorporating relevant and up - to - date information retrieved from search engines into the prompts. The experimental results show that FRESH PROMPT can not only significantly improve the model's accuracy but also reduce the occurrence of hallucination phenomena.

FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

FastLearn: A Rapid Learning Agent for Chat Models to Acquire Latest Knowledge

CuriousLLM: Elevating Multi-Document QA with Reasoning-Infused Knowledge Graph Prompting

Enhancing Large Language Models' Situated Faithfulness to External Contexts

Long-form factuality in large language models

Know where to go: Make LLM a relevant, responsible, and trustworthy searchers

ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks

LEAF: Learning and Evaluation Augmented by Fact-Checking to Improve Factualness in Large Language Models

Towards Reliable and Fluent Large Language Models: Incorporating Feedback Learning Loops in QA Systems

HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild

Towards Mitigating Hallucination in Large Language Models via Self-Reflection

Self-Prompting Large Language Models for Zero-Shot Open-Domain QA

Don't Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration

Query Refinement Prompts for Closed-Book Long-Form Question Answering

Investigating Answerability of LLMs for Long-Form Question Answering

Towards Faithful and Robust LLM Specialists for Evidence-Based Question-Answering

QuickLLaMA: Query-aware Inference Acceleration for Large Language Models

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style

FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows"

JMLR: Joint Medical LLM and Retrieval Training for Enhancing Reasoning and Professional Question Answering Capability