NOVO: Learnable and Interpretable Document Identifiers for Model-Based IR

Zihan Wang,Yujia Zhou,Yiteng Tu,Zhicheng Dou
DOI: https://doi.org/10.1145/3583780.3614993
2023-01-01
Abstract:Model-based Information Retrieval (Model-based IR) has gained attention due to advancements in generative language models. Unlike traditional dense retrieval methods relying on dense vector representations of documents, model-based IR leverages language models to retrieve documents by generating their unique discrete identifiers (docids). This approach effectively reduces the requirements to store separate document representations in an index. Most existing model-based IR approaches utilize pre-defined static docids, i.e., these docids are fixed and are not learnable by training on the retrieval tasks. However, these docids are not specifically optimized for retrieval tasks, which makes it difficult to learn semantics and relationships between documents and achieve satisfactory retrieval performance. To address the above limitations, we propose Neural Optimized VOcabularial (NOVO) docids. NOVO docids are unique n-gram sets identifying each document. They can be generated in any order to retrieve the corresponding document and can be optimized through training to better learn semantics and relationships between documents. We propose to optimize NOVO docids through query denoising modeling and retrieval tasks, allowing for optimizing both semantic and token representations for such docids. Experiments on two datasets under the normal and zero-shot settings show that NOVO exhibits strong performance in more effective and interpretable model-based IR.
What problem does this paper attempt to address?