Abstract:Recent studies demonstrate that query expansions generated by large language models (LLMs) can considerably enhance information retrieval systems by generating hypothetical documents that answer the queries as expansions. However, challenges arise from misalignments between the expansions and the retrieval corpus, resulting in issues like hallucinations and outdated information due to the limited intrinsic knowledge of LLMs. Inspired by Pseudo Relevance Feedback (PRF), we introduce Corpus-Steered Query Expansion (CSQE) to promote the incorporation of knowledge embedded within the corpus. CSQE utilizes the relevance assessing capability of LLMs to systematically identify pivotal sentences in the initially-retrieved documents. These corpus-originated texts are subsequently used to expand the query together with LLM-knowledge empowered expansions, improving the relevance prediction between the query and the target documents. Extensive experiments reveal that CSQE exhibits strong performance without necessitating any training, especially with queries for which LLMs lack knowledge.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address the issues encountered by large language models (LLMs) in generating query expansions. Although query expansions generated by LLMs can significantly enhance the performance of information retrieval systems, these expansions often lack consistency with the retrieval corpus, leading to problems such as hallucination and outdated information. These issues are primarily due to the limited intrinsic knowledge of LLMs. To solve these problems, the authors propose a method called **Corpus-Steered Query Expansion (CSQE)**. CSQE leverages the relevance assessment capabilities of LLMs to identify key sentences from initially retrieved documents and combines these sentences with the expansion text generated by LLMs to improve the relevance prediction between the query and the target documents. This method not only reduces hallucination and outdated information issues but also enhances retrieval performance when LLMs lack knowledge. ### Specific Problems 1. **Hallucination Problem**: The expansion text generated by LLMs may contain inaccurate or misleading content, affecting retrieval performance. 2. **Outdated Information**: The intrinsic knowledge of LLMs is limited and cannot be updated in real-time, leading to the generation of expansion text that may contain outdated information. 3. **Insufficient Long-Tail Knowledge**: LLMs perform poorly in handling long-tail knowledge, which may result in the generation of expansion text lacking relevance. ### Solution CSQE addresses the above problems through the following steps: 1. **Initial Retrieval**: Use traditional retrieval models like BM25 to retrieve the top k documents related to the query from the corpus. 2. **Relevance Assessment**: Utilize LLMs to assess the relevance of these documents and extract key sentences that contribute to relevance. 3. **Query Expansion**: Combine these key sentences with the expansion text generated by LLMs to form a new query for the final retrieval. In this way, CSQE can improve the accuracy and relevance of query expansions without relying on the intrinsic knowledge of LLMs. Experimental results show that CSQE performs excellently across multiple datasets, especially in scenarios where LLMs lack knowledge, outperforming other methods.

Corpus-Steered Query Expansion with Large Language Models

Query Expansion by Prompting Large Language Models

Exploring the Best Practices of Query Expansion with Large Language Models

Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval

MILL: Mutual Verification with Large Language Models for Zero-Shot Query Expansion

Query Rewriting for Retrieval-Augmented Large Language Models

Progressive Query Expansion for Retrieval Over Cost-constrained Data Sources

Leveraging Large Language Models for Multiple Choice Question Answering

Know where to go: Make LLM a relevant, responsible, and trustworthy searchers

Redefining Information Retrieval of Structured Database via Large Language Models

QUILL: Query Intent with Large Language Models using Retrieval Augmentation and Multi-stage Distillation

Query Performance Prediction using Relevance Judgments Generated by Large Language Models

Knowledge-Augmented Large Language Models for Personalized Contextual Query Suggestion

Generative Relevance Feedback with Large Language Models

Boosting legal case retrieval by query content selection with large language models

CorpusLM: Towards a Unified Language Model on Corpus for Knowledge-Intensive Tasks

Large Search Model: Redefining Search Stack in the Era of LLMs

KS-LLM: Knowledge Selection of Large Language Models with Evidence Document for Question Answering

Query Rewriting via Large Language Models

Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement

Enhancing Conversational Search: Large Language Model-Aided Informative Query Rewriting