FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

Tu Vu,Mohit Iyyer,Xuezhi Wang,Noah Constant,Jerry Wei,Jason Wei,Chris Tar,Yun-Hsuan Sung,Denny Zhou,Quoc Le,Thang Luong
2023-11-22
Abstract:Most large language models (LLMs) are trained once and never updated; thus, they lack the ability to dynamically adapt to our ever-changing world. In this work, we perform a detailed study of the factuality of LLM-generated text in the context of answering questions that test current world knowledge. Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Through human evaluations involving more than 50K judgments, we shed light on limitations of these models and demonstrate significant room for improvement: for instance, all models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA by incorporating relevant and up-to-date information retrieved from a search engine into the prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as <a class="link-external link-http" href="http://Perplexity.AI" rel="external noopener nofollow">this http URL</a>. Further analysis of FreshPrompt reveals that both the number of retrieved evidences and their order play a key role in influencing the correctness of LLM-generated answers. Additionally, instructing the LLM to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers. To facilitate future work, we release FreshQA at <a class="link-external link-http" href="http://github.com/freshllms/freshqa" rel="external noopener nofollow">this http URL</a> and commit to updating it at regular intervals.
Computation and Language
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the factual accuracy issue of current large - language models (LLMs) when answering questions that require the latest world knowledge. Specifically: 1. **Lack of Dynamic Adaptability**: Most large - language models are no longer updated once trained, which results in their lack of the ability to dynamically adapt to the ever - changing world, especially performing poorly when dealing with questions that require rapidly changing knowledge. 2. **Factual Accuracy Challenges**: The paper evaluated the performance of existing large - language models in answering questions testing current world knowledge by introducing a new dynamic question - answering benchmark named FRESH QA. FRESH QA contains various types of questions, including those requiring rapidly changing knowledge and those with wrong premises that need to be refuted. 3. **Hallucination Problem**: The research found that, regardless of the model size, all models have significant hallucination problems on questions involving rapidly changing knowledge and wrong premises, that is, generating seemingly reasonable but actually incorrect information. To address these challenges, the paper proposed a simple and effective method - FRESH PROMPT. This method significantly improved the performance of large - language models on FRESH QA by incorporating relevant and up - to - date information retrieved from search engines into the prompts. The experimental results show that FRESH PROMPT can not only significantly improve the model's accuracy but also reduce the occurrence of hallucination phenomena.