Abstract:Ranker and retriever are two important components in dense passage retrieval. The retriever typically adopts a dual-encoder model, where queries and documents are separately input into two pre-trained models, and the vectors generated by the models are used for similarity calculation. The ranker often uses a cross-encoder model, where the concatenated query-document pairs are input into a pre-trained model to obtain word similarities. However, the dual-encoder model lacks interaction between queries and documents due to its independent encoding, while the cross-encoder model requires substantial computational cost for attention calculation, making it difficult to obtain real-time retrieval results. In this paper, we propose a dense retrieval model called MD2PR based on multi-level distillation. In this model, we distill the knowledge learned from the cross-encoder to the dual-encoder at both the sentence level and word level. Sentence-level distillation enhances the dual-encoder on capturing the themes and emotions of sentences. Word-level distillation improves the dual-encoder in analysis of word semantics and relationships. As a result, the dual-encoder can be used independently for subsequent encoding and retrieval, avoiding the significant computational cost associated with the participation of the cross-encoder. Furthermore, we propose a simple dynamic filtering method, which updates the threshold during multiple training iterations to ensure the effective identification of false negatives and thus obtains a more comprehensive semantic representation space. The experimental results over two standard datasets show our MD2PR outperforms 11 baseline models in terms of MRR and Recall metrics.

Another Look at DPR: Reproduction of Training and Replication of Retrieval

Dense Passage Retrieval: Is it Retrieving?

Expand, Rerank, and Retrieve: Query Reranking for Open-Domain Question Answering.

Dense Hierarchical Retrieval for Open-Domain Question Answering

Synthetic Target Domain Supervision for Open Retrieval QA

ControlRetriever: Harnessing the Power of Instructions for Controllable Retrieval

Learning To Retrieve: How to Train a Dense Retrieval Model Effectively and Efficiently

Contrastive Refinement for Dense Retrieval Inference in the Open-Domain Question Answering Task

A Multi-level Distillation based Dense Passage Retrieval Model

DynamicRetriever: A Pre-training Model-based IR System with Neither Sparse nor Dense Index

SpeechDPR: End-to-End Spoken Passage Retrieval for Open-Domain Spoken Question Answering

DPTDR: Deep Prompt Tuning for Dense Passage Retrieval

Span prompt dense passage retrieval for Chinese open domain question answering

DAPR: A Benchmark on Document-Aware Passage Retrieval

End-to-End Training of Neural Retrievers for Open-Domain Question Answering

Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations

Low-Resource Dense Retrieval for Open-Domain Question Answering: A Comprehensive Survey

Mutually improved dense retriever and GNN-based reader for arbitrary-hop open-domain question answering

Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations

Offline Pseudo Relevance Feedback for Efficient and Effective Single-pass Dense Retrieval

Top K Relevant Passage Retrieval for Biomedical Question Answering