Abstract:ANNS for embedded vector representations of texts is commonly used in information retrieval, with two important information representations being sparse and dense vectors. While it has been shown that combining these representations improves accuracy, the current method of conducting sparse and dense vector searches separately suffers from low scalability and high system complexity. Alternatively, building a unified index faces challenges with accuracy and efficiency. To address these issues, we propose a graph-based ANNS algorithm for dense-sparse hybrid vectors. Firstly, we propose a distribution alignment method to improve accuracy, which pre-samples dense and sparse vectors to analyze their distance distribution statistic, resulting in a 1%$\sim$9% increase in accuracy. Secondly, to improve efficiency, we design an adaptive two-stage computation strategy that initially computes dense distances only and later computes hybrid distances. Further, we prune the sparse vectors to speed up the calculation. Compared to naive implementation, we achieve $\sim2.1\times$ acceleration. Thorough experiments show that our algorithm achieves 8.9x$\sim$11.7x throughput at equal accuracy compared to existing hybrid vector search algorithms.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in information retrieval, how to perform approximate nearest neighbor search (ANNS) on dense - sparse mixed vectors efficiently and effectively. Specifically, the paper proposes solutions to the following two key challenges: 1. **Low accuracy caused by data distribution differences**: - There are significant differences in the data distributions of dense vectors and sparse vectors, which makes it difficult to achieve the ideal retrieval effect when combining the two. For example, the same distance difference may represent different similarity differences in the dense space and the sparse space. - The paper improves the retrieval accuracy by introducing a distribution alignment method. Through pre - sampling analysis of the distance distributions of dense and sparse vectors, the optimal fusion weight is determined. 2. **Low computational efficiency caused by high - dimensionality and sparsity**: - The high - dimensionality and sparse characteristics of sparse vectors make distance calculation very costly. Especially in graph indexing, the inner product (IP) distance calculation time of sparse vectors is far longer than that of dense vectors. - The paper proposes an adaptive two - stage calculation strategy. In the initial stage, only the dense distance is calculated, and in the later stage, the mixed distance is calculated, and the calculation is further accelerated by pruning sparse vectors. ### Overview of solutions 1. **Distribution alignment method**: - By pre - sampling dense and sparse vectors and statistically analyzing their distance distributions, the optimal fusion weight is found. The formula is as follows: \[ f_h(q, d)=\alpha\cdot f(q_d, d_d)+(1 - \alpha)\cdot\gamma\cdot f(q_s^{\text{norm}}, d_s^{\text{norm}}) \] where $\alpha$ is the dense weight, $\gamma$ is the scale factor, and $f(q_s^{\text{norm}}, d_s^{\text{norm}})$ is the normalized sparse distance. 2. **Adaptive two - stage calculation strategy**: - In the initial stage, only the dense distance is used to construct and search the graph index, reducing unnecessary sparse distance calculations. - In the subsequent stage, the sparse distance is introduced for fine - tuning to ensure the accuracy of the final result. 3. **Sparse vector pruning**: - Prune the small - value elements in sparse vectors to reduce the time cost of inner product calculation, thereby improving the overall retrieval efficiency. Through these methods, the paper achieves higher retrieval accuracy and faster retrieval speed than existing methods, and achieves an end - to - end throughput improvement of 8.9 to 11.7 times on mainstream text retrieval datasets.

Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search

Optimizing Graph-based Approximate Nearest Neighbor Search: Stronger and Smarter

JUNO: Optimizing High-Dimensional Approximate Nearest Neighbour Search with Sparsity-Aware Algorithm and Ray-Tracing Core Mapping

SeRF: Segment Graph for Range-Filtering Approximate Nearest Neighbor Search

FusionANNS: An Efficient CPU/GPU Cooperative Processing Architecture for Billion-scale Approximate Nearest Neighbor Search

Fast spectral analysis for approximate nearest neighbor search

Fast Approximate Nearest Neighbor Search with the Navigating Spreading-out Graph.

DiskANN++: Efficient Page-based Search over Isomorphic Mapped Graph Index using Query-sensitivity Entry Vertex

ParlayANN: Scalable and Deterministic Parallel Graph-Based Approximate Nearest Neighbor Search Algorithms

A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search

Satellite System Graph: Towards the Efficiency Up-Boundary of Graph-Based Approximate Nearest Neighbor Search

High Dimensional Similarity Search with Satellite System Graph: Efficiency, Scalability, and Unindexed Query Compatibility

Sparse Matrix Based Hashing for Approximate Nearest Neighbor Search.

HJG: an Effective Hierarchical Joint Graph for ANNS in Multi-Metric Spaces

Vector and Line Quantization for Billion-scale Similarity Search on GPUs

SPFresh: Incremental In-Place Update for Billion-Scale Vector Search

Multiattribute approximate nearest neighbor search based on navigable small world graph

Subspace Collision: An Efficient and Accurate Framework for High-dimensional Approximate Nearest Neighbor Search

PECANN: Parallel Efficient Clustering with Graph-Based Approximate Nearest Neighbor Search

AdANNS: A Framework for Adaptive Semantic Search

RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search