Approximate Vector Set Search: A Bio-Inspired Approach for High-Dimensional Spaces

Yiqi Li,Sheng Wang,Zhiyu Chen,Shangfeng Chen,Zhiyong Peng
2024-12-04
Abstract:Vector set search, an underexplored similarity search paradigm, aims to find vector sets similar to a query set. This search paradigm leverages the inherent structural alignment between sets and real-world entities to model more fine-grained and consistent relationships for diverse applications. This task, however, faces more severe efficiency challenges than traditional single-vector search due to the combinatorial explosion of pairings in set-to-set comparisons. In this work, we aim to address the efficiency challenges posed by the combinatorial explosion in vector set search, as well as the curse of dimensionality inherited from single-vector search. To tackle these challenges, we present an efficient algorithm for vector set search, BioVSS (Bio-inspired Vector Set Search). BioVSS simulates the fly olfactory circuit to quantize vectors into sparse binary codes and then designs an index based on the set membership property of the Bloom filter. The quantization and indexing strategy enables BioVSS to efficiently perform vector set search by pruning the search space. Experimental results demonstrate over 50 times speedup compared to linear scanning on million-scale datasets while maintaining a high recall rate of up to 98.9%, making it an efficient solution for vector set search.
Databases
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the efficiency challenge of vector set search in high - dimensional space, especially the computational complexity problems brought by combinatorial explosion and the curse of dimensionality. Specifically: 1. **Combinatorial explosion**: In the similarity search between vector sets, a large number of pairwise comparisons are required, resulting in a sharp increase in computational complexity. 2. **Curse of dimensionality**: A problem inherited from single - vector search. In high - dimensional space, as the dimension increases, the distances between data points become more and more difficult to distinguish, thus increasing the difficulty of the search. To solve these problems, the author proposes an efficient vector set search algorithm - BioVSS (Bio - inspired Vector Set Search). This algorithm quantifies vectors by simulating the Drosophila olfactory circuit and designs an indexing strategy based on the set - member property of Bloom filters. This method can effectively reduce the search space and the number of aggregation operations, thereby improving the search efficiency. ### Specific contributions of the paper 1. **Defined the approximate vector set search problem using Hausdorff distance in high - dimensional space**: This is the first vector set search problem that clearly uses Hausdorff distance as the native set - metric distance. 2. **Proposed the BioVSS algorithm**: Utilizes the properties of locality - sensitive hashing (LSH) to accelerate vector set search, and provides detailed theoretical analysis and proof to verify the correctness of the proposed method. 3. **Enhanced the BioVSS++ version**: Adopts a two - level cascaded filter (consisting of an inverted index and a vector set sketch) to reduce unnecessary scans. 4. **Experimental verification**: Extensive experiments show that this method is more than 50 times faster than linear scanning on million - scale datasets while maintaining a recall rate of up to 98.9%, verifying its high efficiency. ### Formula representation - **Hausdorff distance**: \[ H_{\text{aus}}(Q, V)=\max \left(\max _{q \in Q} \min _{v \in V} \text{dist}(q, v), \max _{v \in V} \min _{q \in Q} \text{dist}(v, q)\right) \] where \(\text{dist}(q, v)=\|q - v\|_2\) is the Euclidean distance between vectors \(q \in \mathbb{R}^d\) and \(v \in \mathbb{R}^d\). - **Approximate Top - \(k\) vector set search**: \[ R = \{V_1^*, V_2^*, \ldots, V_k^*\}=\arg\min _k \{H_{\text{aus}}(Q, V_i^*) \mid V_i^* \in D\} \] where \(\arg\min _k V_i \in D\) selects \(k\) sets \(V_i\) that minimize \(H_{\text{aus}}(Q, V_i)\), and the distance between \(V_k\) and the query \(Q\) is the \(k\) - th smallest. Through these contributions, this paper provides an effective method to deal with the computational complexity problem of vector set search in high - dimensional space, which has important theoretical and practical application values.