Abstract:Most large language models (LLMs) are trained once and never updated; thus, they lack the ability to dynamically adapt to our ever-changing world. In this work, we perform a detailed study of the factuality of LLM-generated text in the context of answering questions that test current world knowledge. Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Through human evaluations involving more than 50K judgments, we shed light on limitations of these models and demonstrate significant room for improvement: for instance, all models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA by incorporating relevant and up-to-date information retrieved from a search engine into the prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as <a class="link-external link-http" href="http://Perplexity.AI" rel="external noopener nofollow">this http URL</a>. Further analysis of FreshPrompt reveals that both the number of retrieved evidences and their order play a key role in influencing the correctness of LLM-generated answers. Additionally, instructing the LLM to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers. To facilitate future work, we release FreshQA at <a class="link-external link-http" href="http://github.com/freshllms/freshqa" rel="external noopener nofollow">this http URL</a> and commit to updating it at regular intervals.

Towards Faithful and Robust LLM Specialists for Evidence-Based Question-Answering

Enhancing Answer Attribution for Faithful Text Generation with Large Language Models

KS-LLM: Knowledge Selection of Large Language Models with Evidence Document for Question Answering

Fine-Tuning LLMs for Reliable Medical Question-Answering Services

Know where to go: Make LLM a relevant, responsible, and trustworthy searchers

Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately

Empirical Insights on Fine-Tuning Large Language Models for Question-Answering

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

Benchmarking Large Language Models in Evidence-Based Medicine

Effective Large Language Model Adaptation for Improved Grounding and Citation Generation

FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

Chain-of-Discussion: A Multi-Model Framework for Complex Evidence-Based Question Answering

Investigating Answerability of LLMs for Long-Form Question Answering

Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style

Know Where to Go: Make LLM a Relevant, Responsible, and Trustworthy Searcher

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Retrieving Supporting Evidence for LLMs Generated Answers

Can LLMs Produce Faithful Explanations For Fact-checking? Towards Faithful Explainable Fact-Checking via Multi-Agent Debate

Aggregated Knowledge Model: Enhancing Domain-Specific QA with Fine-Tuned and Retrieval-Augmented Generation Models

Enhancing Large Language Models' Situated Faithfulness to External Contexts

Evaluating the Impact of Advanced LLM Techniques on AI-Lecture Tutors for a Robotics Course