Anserini Gets Dense Retrieval: Integration of Lucene's HNSW Indexes

Xueguang Ma,Tommaso Teofili,Jimmy Lin

2023-04-24

Abstract:Anserini is a Lucene-based toolkit for reproducible information retrieval research in Java that has been gaining traction in the community. It provides retrieval capabilities for both "traditional" bag-of-words retrieval models such as BM25 as well as retrieval using learned sparse representations such as SPLADE. With Pyserini, which provides a Python interface to Anserini, users gain access to both sparse and dense retrieval models, as Pyserini implements bindings to the Faiss vector search library alongside Lucene inverted indexes in a uniform, consistent interface. Nevertheless, hybrid fusion techniques that integrate sparse and dense retrieval models need to stitch together results from two completely different "software stacks", which creates unnecessary complexities and inefficiencies. However, the introduction of HNSW indexes for dense vector search in Lucene promises the integration of both dense and sparse retrieval within a single software framework. We explore exactly this integration in the context of Anserini. Experiments on the MS MARCO passage and BEIR datasets show that our Anserini HNSW integration supports (reasonably) effective and (reasonably) efficient approximate nearest neighbor search for dense retrieval models, using only Lucene.

Information Retrieval

What problem does this paper attempt to address?

The main problem this paper attempts to address is the integration of dense retrieval and sparse retrieval within the same software framework to simplify the development and deployment of information retrieval systems. Specifically, the paper focuses on the following aspects: 1. **Existing Problems**: - Current dense retrieval models (such as those based on dual-encoder architectures) require a completely different "software stack" for efficient top-𝑘 retrieval, typically implemented using libraries like Faiss. - Sparse retrieval (such as BM25) relies on inverted indexes, primarily supported by libraries like Lucene. - Combining these two methods requires stitching together two different "software stacks," leading to unnecessary complexity and inefficiency. 2. **Solution**: - Utilize the HNSW index introduced in the latest version of Lucene (version 9) to integrate dense retrieval and sparse retrieval into the same framework. - Implement the integration of the HNSW index in the Anserini tool, providing a unified interface to handle both dense and sparse retrieval. 3. **Goals**: - Simplify the "research toolchain" for researchers, improving the efficiency of experimental development. - Provide practitioners with a fair comparison to evaluate the performance differences between Lucene and Faiss in HNSW implementation, including trade-offs in result quality, time, and space. Through these efforts, the paper aims to provide a unified solution that allows researchers and practitioners to more conveniently combine the advantages of dense and sparse retrieval, thereby enhancing the overall performance of information retrieval systems.

Anserini Gets Dense Retrieval: Integration of Lucene's HNSW Indexes

Operational Advice for Dense and Sparse Retrievers: HNSW, Flat, or Inverted Indexes?

Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face

Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach

Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations

Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations

A Unified Framework for Learned Sparse Retrieval

EHI: End-to-end Learning of Hierarchical Index for Efficient Dense Retrieval

Efficient and Interpretable Information Retrieval for Product Question Answering with Heterogeneous Data

Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search

Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations

SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking

LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors

SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot Neural Sparse Retrieval

Dense Sparse Retrieval: Using Sparse Language Models for Inference Efficient Dense Retrieval

PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval.

DeeperImpact: Optimizing Sparse Learned Index Structures

Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers

On Single and Multiple Representations in Dense Passage Retrieval

Synergistic Interplay between Search and Large Language Models for Information Retrieval

Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval