BM25S: Orders of magnitude faster lexical search via eager sparse scoring

Xing Han Lù
2024-07-04
Abstract:We introduce BM25S, an efficient Python-based implementation of BM25 that only depends on Numpy and Scipy. BM25S achieves up to a 500x speedup compared to the most popular Python-based framework by eagerly computing BM25 scores during indexing and storing them into sparse matrices. It also achieves considerable speedups compared to highly optimized Java-based implementations, which are used by popular commercial products. Finally, BM25S reproduces the exact implementation of five BM25 variants based on Kamphuis et al. (2020) by extending eager scoring to non-sparse variants using a novel score shifting method. The code can be found at <a class="link-external link-https" href="https://github.com/xhluca/bm25s" rel="external noopener nofollow">this https URL</a>
Information Retrieval,Computation and Language
What problem does this paper attempt to address?
This paper focuses on improving the search speed of the Python-based BM25 (Best Match 25) algorithm, which is a widely used untrained algorithm for text retrieval. Existing Python implementations are typically slower than efficient Java libraries (such as Lucene-based implementations). The authors propose a new Python library called BM25S, which achieves significant acceleration compared to existing Python implementations by precomputing and storing all possible query scores in a sparse matrix at indexing time, and in some cases even outperforms Java libraries. In addition, BM25S simplifies the implementation of different variants of BM25 and introduces a fast Python tokenizer that supports stopword lists and optional stemming. The core improvements of BM25S include: 1. **Precomputation and storage of scores**: Compute the BM25 score for each possible query term at document indexing and store it in a sparse matrix to accelerate retrieval at query time. 2. **Efficient matrix sparsity**: Use compressed sparse column (CSC) format to store the matrix, optimize slicing and summing operations, and reduce memory usage. 3. **Simple tokenizer**: Combine Scikit-Learn's tokenization method with Elasticsearch's stopword list, and an optional C language implementation of the Snowball stemmer to improve performance. 4. **Fast selection of top k results**: Adopt an algorithm with average time complexity O(n) to select the most relevant k documents, avoiding the high time complexity of sorting. Experiments in the paper demonstrate that BM25S achieves much higher throughput (queries per second, QPS) than other Python libraries such as Rank-BM25 when processing various datasets, with speed improvements of up to 500 times in some cases. Moreover, the paper compares the impact of different tokenization strategies, BM25 variants, and parameter settings on performance. In summary, this paper addresses the efficiency of the Python-based BM25 algorithm, enabling it to handle text retrieval tasks faster while maintaining accuracy.