Abstract:Dense retrieval overcome the lexical gap and has shown great success in ad-hoc information retrieval (IR). Despite their success, dense retrievers are expensive to serve across practical use cases. For use cases requiring to search from millions of documents, the dense index becomes bulky and requires high memory usage for storing the index. More recently, learning-to-hash (LTH) techniques, for e.g., BPR and JPQ, produce binary document vectors, thereby reducing the memory requirement to efficiently store the dense index. LTH techniques are supervised and finetune the retriever using a ranking loss. They outperform their counterparts, i.e., traditional out-of-the-box vector compression techniques such as PCA or PQ. A missing piece from prior work is that existing techniques have been evaluated only in-domain, i.e., on a single dataset such as MS MARCO. In our work, we evaluate LTH and vector compression techniques for improving the downstream zero-shot retrieval accuracy of the TAS-B dense retriever while maintaining efficiency at inference. Our results demonstrate that, unlike prior work, LTH strategies when applied naively can underperform the zero-shot TAS-B dense retriever on average by up to 14% nDCG@10 on the BEIR benchmark. To solve this limitation, in our work, we propose an easy yet effective solution of injecting domain adaptation with existing supervised LTH techniques. We experiment with two well-known unsupervised domain adaptation techniques: GenQ and GPL. Our domain adaptation injection technique can improve the downstream zero-shot retrieval effectiveness for both BPR and JPQ variants of the TAS-B model by on average 11.5% and 8.2% nDCG@10 while both maintaining 32$\times$ memory efficiency and 14$\times$ and 2$\times$ speedup respectively in CPU retrieval latency on BEIR. All our code, models, and data are publicly available at <a class="link-external link-https" href="https://github.com/thakur-nandan/income" rel="external noopener nofollow">this https URL</a>.

LaPraDoR: Unsupervised Pretrained Dense Retriever for Zero-Shot Text Retrieval

Unsupervised Text Representation Learning via Instruction-Tuning for Zero-Shot Dense Retrieval

Unsupervised Dense Retrieval with Relevance-Aware Contrastive Pre-Training

Learning To Retrieve: How to Train a Dense Retrieval Model Effectively and Efficiently

A Thorough Examination on Zero-shot Dense Retrieval

Adversarial Retriever-Ranker for Dense Text Retrieval.

Preprint. Under review

ControlRetriever: Harnessing the Power of Instructions for Controllable Retrieval

Precise Zero-Shot Dense Retrieval without Relevance Labels

LED: Lexicon-Enlightened Dense Retriever for Large-Scale Retrieval.

Zero-Shot Dense Retrieval with Embeddings from Relevance Feedback

Lora for dense passage retrieval of ConTextual masked auto-encoding

Longtriever: a Pre-trained Long Text Encoder for Dense Document Retrieval

Improving zero-shot retrieval using dense external expansion

Another Look at DPR: Reproduction of Training and Replication of Retrieval

A Distributed Collaborative Retrieval Framework Excelling in All Queries and Corpora based on Zero-shot Rank-Oriented Automatic Evaluation

Injecting Domain Adaptation with Learning-to-hash for Effective and Efficient Zero-shot Dense Retrieval

PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval

Leveraging LLMs for Unsupervised Dense Retriever Ranking

LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval