Generative Retrieval with Few-shot Indexing

Arian Askari,Chuan Meng,Mohammad Aliannejadi,Zhaochun Ren,Evangelos Kanoulas,Suzan Verberne
2024-08-05
Abstract:Existing generative retrieval (GR) approaches rely on training-based indexing, i.e., fine-tuning a model to memorise the associations between a query and the document identifier (docid) of a relevant document. Training-based indexing has three limitations: high training overhead, under-utilization of the pre-trained knowledge of large language models (LLMs), and challenges in adapting to a dynamic document corpus. To address the above issues, we propose a novel few-shot indexing-based GR framework (Few-Shot GR). It has a novel few-shot indexing process, where we prompt an LLM to generate docids for all documents in a corpus, ultimately creating a docid bank for the entire corpus. During retrieval, we feed a query to the same LLM and constrain it to generate a docid within the docid bank created during indexing, and then map the generated docid back to its corresponding document. Few-Shot GR relies solely on prompting an LLM without requiring any training, making it more efficient. Moreover, we devise few-shot indexing with one-to-many mapping to further enhance Few-Shot GR. Experiments show that Few-Shot GR achieves superior performance to state-of-the-art GR methods that require heavy training.
Information Retrieval,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
This paper aims to address several key issues in the field of Generative Retrieval (GR): 1. **High Training Costs**: Existing generative retrieval methods rely on training-based indexing, which involves fine-tuning models to memorize the associations between queries and relevant document identifiers (docid). This approach requires a large amount of training data, time, and computational resources. 2. **Insufficient Utilization of Pre-trained Knowledge**: Due to the gap between the pre-training objectives of large language models (LLMs) (natural language generation) and the fine-tuning objectives of generative retrieval (mapping queries to docids), existing methods may cause the model to forget its pre-trained knowledge during fine-tuning. 3. **Difficulty in Adapting to Dynamic Document Corpora**: Existing methods face challenges in handling continuously updated documents, as adding new documents requires retraining the model, which may lead to the loss of memory of old documents. To address the above issues, the authors propose a new framework—Few-Shot Generative Retrieval (Few-Shot GR). This framework employs a few-shot prompting approach, leveraging large language models to directly generate document identifiers without any training steps. This method not only improves efficiency but also better utilizes the pre-trained knowledge of LLMs and can more easily adapt to dynamic document collections. Experimental results show that Few-Shot GR outperforms existing generative retrieval methods that require extensive training in terms of performance.