Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search

Haoyu Zhang,Jun Liu,Zhenhua Zhu,Shulin Zeng,Maojia Sheng,Tao Yang,Guohao Dai,Yu Wang
2024-10-27
Abstract:ANNS for embedded vector representations of texts is commonly used in information retrieval, with two important information representations being sparse and dense vectors. While it has been shown that combining these representations improves accuracy, the current method of conducting sparse and dense vector searches separately suffers from low scalability and high system complexity. Alternatively, building a unified index faces challenges with accuracy and efficiency. To address these issues, we propose a graph-based ANNS algorithm for dense-sparse hybrid vectors. Firstly, we propose a distribution alignment method to improve accuracy, which pre-samples dense and sparse vectors to analyze their distance distribution statistic, resulting in a 1%$\sim$9% increase in accuracy. Secondly, to improve efficiency, we design an adaptive two-stage computation strategy that initially computes dense distances only and later computes hybrid distances. Further, we prune the sparse vectors to speed up the calculation. Compared to naive implementation, we achieve $\sim2.1\times$ acceleration. Thorough experiments show that our algorithm achieves 8.9x$\sim$11.7x throughput at equal accuracy compared to existing hybrid vector search algorithms.
Information Retrieval
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in information retrieval, how to perform approximate nearest neighbor search (ANNS) on dense - sparse mixed vectors efficiently and effectively. Specifically, the paper proposes solutions to the following two key challenges: 1. **Low accuracy caused by data distribution differences**: - There are significant differences in the data distributions of dense vectors and sparse vectors, which makes it difficult to achieve the ideal retrieval effect when combining the two. For example, the same distance difference may represent different similarity differences in the dense space and the sparse space. - The paper improves the retrieval accuracy by introducing a distribution alignment method. Through pre - sampling analysis of the distance distributions of dense and sparse vectors, the optimal fusion weight is determined. 2. **Low computational efficiency caused by high - dimensionality and sparsity**: - The high - dimensionality and sparse characteristics of sparse vectors make distance calculation very costly. Especially in graph indexing, the inner product (IP) distance calculation time of sparse vectors is far longer than that of dense vectors. - The paper proposes an adaptive two - stage calculation strategy. In the initial stage, only the dense distance is calculated, and in the later stage, the mixed distance is calculated, and the calculation is further accelerated by pruning sparse vectors. ### Overview of solutions 1. **Distribution alignment method**: - By pre - sampling dense and sparse vectors and statistically analyzing their distance distributions, the optimal fusion weight is found. The formula is as follows: \[ f_h(q, d)=\alpha\cdot f(q_d, d_d)+(1 - \alpha)\cdot\gamma\cdot f(q_s^{\text{norm}}, d_s^{\text{norm}}) \] where $\alpha$ is the dense weight, $\gamma$ is the scale factor, and $f(q_s^{\text{norm}}, d_s^{\text{norm}})$ is the normalized sparse distance. 2. **Adaptive two - stage calculation strategy**: - In the initial stage, only the dense distance is used to construct and search the graph index, reducing unnecessary sparse distance calculations. - In the subsequent stage, the sparse distance is introduced for fine - tuning to ensure the accuracy of the final result. 3. **Sparse vector pruning**: - Prune the small - value elements in sparse vectors to reduce the time cost of inner product calculation, thereby improving the overall retrieval efficiency. Through these methods, the paper achieves higher retrieval accuracy and faster retrieval speed than existing methods, and achieves an end - to - end throughput improvement of 8.9 to 11.7 times on mainstream text retrieval datasets.