Corpus-Steered Query Expansion with Large Language Models

Yibin Lei,Yu Cao,Tianyi Zhou,Tao Shen,Andrew Yates
2024-02-28
Abstract:Recent studies demonstrate that query expansions generated by large language models (LLMs) can considerably enhance information retrieval systems by generating hypothetical documents that answer the queries as expansions. However, challenges arise from misalignments between the expansions and the retrieval corpus, resulting in issues like hallucinations and outdated information due to the limited intrinsic knowledge of LLMs. Inspired by Pseudo Relevance Feedback (PRF), we introduce Corpus-Steered Query Expansion (CSQE) to promote the incorporation of knowledge embedded within the corpus. CSQE utilizes the relevance assessing capability of LLMs to systematically identify pivotal sentences in the initially-retrieved documents. These corpus-originated texts are subsequently used to expand the query together with LLM-knowledge empowered expansions, improving the relevance prediction between the query and the target documents. Extensive experiments reveal that CSQE exhibits strong performance without necessitating any training, especially with queries for which LLMs lack knowledge.
Information Retrieval,Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the issues encountered by large language models (LLMs) in generating query expansions. Although query expansions generated by LLMs can significantly enhance the performance of information retrieval systems, these expansions often lack consistency with the retrieval corpus, leading to problems such as hallucination and outdated information. These issues are primarily due to the limited intrinsic knowledge of LLMs. To solve these problems, the authors propose a method called **Corpus-Steered Query Expansion (CSQE)**. CSQE leverages the relevance assessment capabilities of LLMs to identify key sentences from initially retrieved documents and combines these sentences with the expansion text generated by LLMs to improve the relevance prediction between the query and the target documents. This method not only reduces hallucination and outdated information issues but also enhances retrieval performance when LLMs lack knowledge. ### Specific Problems 1. **Hallucination Problem**: The expansion text generated by LLMs may contain inaccurate or misleading content, affecting retrieval performance. 2. **Outdated Information**: The intrinsic knowledge of LLMs is limited and cannot be updated in real-time, leading to the generation of expansion text that may contain outdated information. 3. **Insufficient Long-Tail Knowledge**: LLMs perform poorly in handling long-tail knowledge, which may result in the generation of expansion text lacking relevance. ### Solution CSQE addresses the above problems through the following steps: 1. **Initial Retrieval**: Use traditional retrieval models like BM25 to retrieve the top k documents related to the query from the corpus. 2. **Relevance Assessment**: Utilize LLMs to assess the relevance of these documents and extract key sentences that contribute to relevance. 3. **Query Expansion**: Combine these key sentences with the expansion text generated by LLMs to form a new query for the final retrieval. In this way, CSQE can improve the accuracy and relevance of query expansions without relying on the intrinsic knowledge of LLMs. Experimental results show that CSQE performs excellently across multiple datasets, especially in scenarios where LLMs lack knowledge, outperforming other methods.