Abstract:We introduce BM25S, an efficient Python-based implementation of BM25 that only depends on Numpy and Scipy. BM25S achieves up to a 500x speedup compared to the most popular Python-based framework by eagerly computing BM25 scores during indexing and storing them into sparse matrices. It also achieves considerable speedups compared to highly optimized Java-based implementations, which are used by popular commercial products. Finally, BM25S reproduces the exact implementation of five BM25 variants based on Kamphuis et al. (2020) by extending eager scoring to non-sparse variants using a novel score shifting method. The code can be found at <a class="link-external link-https" href="https://github.com/xhluca/bm25s" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

This paper focuses on improving the search speed of the Python-based BM25 (Best Match 25) algorithm, which is a widely used untrained algorithm for text retrieval. Existing Python implementations are typically slower than efficient Java libraries (such as Lucene-based implementations). The authors propose a new Python library called BM25S, which achieves significant acceleration compared to existing Python implementations by precomputing and storing all possible query scores in a sparse matrix at indexing time, and in some cases even outperforms Java libraries. In addition, BM25S simplifies the implementation of different variants of BM25 and introduces a fast Python tokenizer that supports stopword lists and optional stemming. The core improvements of BM25S include: 1. **Precomputation and storage of scores**: Compute the BM25 score for each possible query term at document indexing and store it in a sparse matrix to accelerate retrieval at query time. 2. **Efficient matrix sparsity**: Use compressed sparse column (CSC) format to store the matrix, optimize slicing and summing operations, and reduce memory usage. 3. **Simple tokenizer**: Combine Scikit-Learn's tokenization method with Elasticsearch's stopword list, and an optional C language implementation of the Snowball stemmer to improve performance. 4. **Fast selection of top k results**: Adopt an algorithm with average time complexity O(n) to select the most relevant k documents, avoiding the high time complexity of sorting. Experiments in the paper demonstrate that BM25S achieves much higher throughput (queries per second, QPS) than other Python libraries such as Rank-BM25 when processing various datasets, with speed improvements of up to 500 times in some cases. Moreover, the paper compares the impact of different tokenization strategies, BM25 variants, and parameter settings on performance. In summary, this paper addresses the efficiency of the Python-based BM25 algorithm, enabling it to handle text retrieval tasks faster while maintaining accuracy.

BM25S: Orders of magnitude faster lexical search via eager sparse scoring

BMX: Entropy-weighted Similarity and Semantic-enhanced Lexical Search

Optimizing Guided Traversal for Fast Learned Sparse Retrieval

Fast Object Retrieval Using Direct Spatial Matching

Enterprise-Scale Search: Accelerating Inference for Sparse Extreme Multi-Label Ranking Trees

skscope: Fast Sparsity-Constrained Optimization in Python

LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors

AlphaSparse: Generating High Performance SpMV Codes Directly from Sparse Matrices

EBFT: Effective and Block-Wise Fine-Tuning for Sparse LLMs

Sparse Bayesian multidimensional scaling(s)

SPLAT: A framework for optimised GPU code-generation for SParse reguLar ATtention

Two-Step SPLADE: Simple, Efficient and Effective Approximation of SPLADE

Sparsity-Constraint Optimization via Splicing Iteration

PyBench: Evaluating LLM Agent on various real-world coding tasks

Listbm: A Learning-To-Rank Method For Xml Keyword Search

SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot Neural Sparse Retrieval

LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval

Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

LitSearch: A Retrieval Benchmark for Scientific Literature Search

Dense Sparse Retrieval: Using Sparse Language Models for Inference Efficient Dense Retrieval

Post-Training Sparse Attention with Double Sparsity