Towards Consistency Filtering-Free Unsupervised Learning for Dense Retrieval

Haoxiang Shi,Sumio Fujita,Tetsuya Sakai
2023-08-06
Abstract:Domain transfer is a prevalent challenge in modern neural Information Retrieval (IR). To overcome this problem, previous research has utilized domain-specific manual annotations and synthetic data produced by consistency filtering to finetune a general ranker and produce a domain-specific ranker. However, training such consistency filters are computationally expensive, which significantly reduces the model efficiency. In addition, consistency filtering often struggles to identify retrieval intentions and recognize query and corpus distributions in a target domain. In this study, we evaluate a more efficient solution: replacing the consistency filter with either direct pseudo-labeling, pseudo-relevance feedback, or unsupervised keyword generation methods for achieving consistent filtering-free unsupervised dense retrieval. Our extensive experimental evaluations demonstrate that, on average, TextRank-based pseudo relevance feedback outperforms other methods. Furthermore, we analyzed the training and inference efficiency of the proposed paradigm. The results indicate that filtering-free unsupervised learning can continuously improve training and inference efficiency while maintaining retrieval performance. In some cases, it can even improve performance based on particular datasets.
Information Retrieval,Computation and Language,Machine Learning,Networking and Internet Architecture
What problem does this paper attempt to address?
The paper attempts to address the challenge of domain transfer in Information Retrieval (IR). Specifically, existing methods typically require the use of manually annotated data from specific domains or the generation of synthetic data through consistency filtering to fine-tune a general ranker to produce a domain-specific ranker. However, these methods have issues such as high computational cost, difficulty in identifying retrieval intent, and the distribution of queries and corpora in the target domain. Therefore, this paper proposes a new solution, which achieves unsupervised dense retrieval without using consistency filtering, through methods such as direct pseudo-labeling, pseudo-relevance feedback, or unsupervised keyword generation. The study aims to improve model efficiency while maintaining or enhancing retrieval performance. The main contributions of the paper include: 1. **Proposing a new unsupervised domain adaptation method**: By replacing consistency filtering methods with direct pseudo-labeling, pseudo-relevance feedback, and unsupervised keyword generation, unsupervised dense retrieval is achieved. 2. **Experimental validation of the method's effectiveness**: Extensive experiments on 2 domain-specific IR datasets demonstrate that the TextRank-based pseudo-relevance feedback method performs best on most metrics. 3. **Analysis of training and inference efficiency**: The study shows that unsupervised learning methods without consistency filtering can continuously improve training and inference efficiency while maintaining retrieval performance. 4. **Discussion of the method's limitations and future improvement directions**: Through failure case analysis, the paper points out the method's shortcomings on certain queries and proposes potential improvement directions. Overall, the paper provides an efficient and low-cost solution to the challenges of cross-domain information retrieval, with significant theoretical and practical application value.