An Efficient Corpus Indexer for dynamic corpora retrieval

Ao Zou,Wenning Hao,Dawei Jin,Shichen Zou,Yu Zheng,Feiyan Sun,Li Xiang
DOI: https://doi.org/10.1016/j.eswa.2024.124306
IF: 8.5
2024-05-31
Expert Systems with Applications
Abstract:As a new paradigm for information retrieval, generative retrieval (GR) has achieved solid performance on various retrieval tasks. Despite its promising progress, this line of research cannot generalize on a dynamic corpora, where new documents are continually added to it. There are already some continual learning-based pioneering works focusing on this issue, yet the continual learning framework requires retraining after model deployment and may suffer from catastrophic forgetting issues. Hence, we propose a new retrieval framework noted as ECI (an Efficient Corpus Indexer for dynamic corpora retrieval). The ECI is a hybrid index framework containing generative and deep hashing indexes. We design a complementary training objective noted as Prefix-Sensitive Similarity Alignment, which can further improve the performance of generative retrieval. Besides, ECI enables incremental deep hashing learning and provides a deep hashing index-based retrieval scheme for new documents, thus solving the generalization problem on dynamic corpora. Furthermore, ECI utilizes techniques like whitening and query-generated data augmentation to enhance retrieval performance. In a dynamic corpus retrieval task built on the commonly used academic benchmark Natural Question, the ECI outperforms various baselines, including the state-of-the-art GR baseline and its variants.
computer science, artificial intelligence,engineering, electrical & electronic,operations research & management science
What problem does this paper attempt to address?