Auto Search Indexer for End-to-End Document Retrieval

Tianchi Yang,Minghui Song,Zihan Zhang,Haizhen Huang,Weiwei Deng,Feng Sun,Qi Zhang
2023-10-30
Abstract:Generative retrieval, which is a new advanced paradigm for document retrieval, has recently attracted research interests, since it encodes all documents into the model and directly generates the retrieved documents. However, its power is still underutilized since it heavily relies on the "preprocessed" document identifiers (docids), thus limiting its retrieval performance and ability to retrieve new documents. In this paper, we propose a novel fully end-to-end retrieval paradigm. It can not only end-to-end learn the best docids for existing and new documents automatically via a semantic indexing module, but also perform end-to-end document retrieval via an encoder-decoder-based generative model, namely Auto Search Indexer (ASI). Besides, we design a reparameterization mechanism to combine the above two modules into a joint optimization framework. Extensive experimental results demonstrate the superiority of our model over advanced baselines on both public and industrial datasets and also verify the ability to deal with new documents.
Information Retrieval
What problem does this paper attempt to address?
The paper aims to address the issues present in generative retrieval, specifically the reliance of existing methods on pre-processed document identifiers (docids), which limits their retrieval performance and ability to retrieve new documents. Specifically, the paper proposes a novel end-to-end retrieval paradigm called Auto Search Indexer (ASI). ASI not only learns the optimal docids for both existing and new documents in an end-to-end manner but also performs end-to-end document retrieval through an encoder-decoder-based generative model, integrating these two processes into a unified optimization framework. The main contributions include: 1. Proposing a fully end-to-end paradigm, ASI, which supports end-to-end docid assignment and document retrieval. 2. Designing a semantic indexing module and two new semantic-oriented loss functions to automatically assign docids to documents, and developing a reparameterization mechanism to enable joint training of all modules. 3. Extensive experiments demonstrate that ASI significantly outperforms existing state-of-the-art methods in document retrieval performance on both public and industrial datasets, and is capable of learning meaningful docids for documents.