Abstract:Vector set search, an underexplored similarity search paradigm, aims to find vector sets similar to a query set. This search paradigm leverages the inherent structural alignment between sets and real-world entities to model more fine-grained and consistent relationships for diverse applications. This task, however, faces more severe efficiency challenges than traditional single-vector search due to the combinatorial explosion of pairings in set-to-set comparisons. In this work, we aim to address the efficiency challenges posed by the combinatorial explosion in vector set search, as well as the curse of dimensionality inherited from single-vector search. To tackle these challenges, we present an efficient algorithm for vector set search, BioVSS (Bio-inspired Vector Set Search). BioVSS simulates the fly olfactory circuit to quantize vectors into sparse binary codes and then designs an index based on the set membership property of the Bloom filter. The quantization and indexing strategy enables BioVSS to efficiently perform vector set search by pruning the search space. Experimental results demonstrate over 50 times speedup compared to linear scanning on million-scale datasets while maintaining a high recall rate of up to 98.9%, making it an efficient solution for vector set search.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the efficiency challenge of vector set search in high - dimensional space, especially the computational complexity problems brought by combinatorial explosion and the curse of dimensionality. Specifically: 1. **Combinatorial explosion**: In the similarity search between vector sets, a large number of pairwise comparisons are required, resulting in a sharp increase in computational complexity. 2. **Curse of dimensionality**: A problem inherited from single - vector search. In high - dimensional space, as the dimension increases, the distances between data points become more and more difficult to distinguish, thus increasing the difficulty of the search. To solve these problems, the author proposes an efficient vector set search algorithm - BioVSS (Bio - inspired Vector Set Search). This algorithm quantifies vectors by simulating the Drosophila olfactory circuit and designs an indexing strategy based on the set - member property of Bloom filters. This method can effectively reduce the search space and the number of aggregation operations, thereby improving the search efficiency. ### Specific contributions of the paper 1. **Defined the approximate vector set search problem using Hausdorff distance in high - dimensional space**: This is the first vector set search problem that clearly uses Hausdorff distance as the native set - metric distance. 2. **Proposed the BioVSS algorithm**: Utilizes the properties of locality - sensitive hashing (LSH) to accelerate vector set search, and provides detailed theoretical analysis and proof to verify the correctness of the proposed method. 3. **Enhanced the BioVSS++ version**: Adopts a two - level cascaded filter (consisting of an inverted index and a vector set sketch) to reduce unnecessary scans. 4. **Experimental verification**: Extensive experiments show that this method is more than 50 times faster than linear scanning on million - scale datasets while maintaining a recall rate of up to 98.9%, verifying its high efficiency. ### Formula representation - **Hausdorff distance**: \[ H_{\text{aus}}(Q, V)=\max \left(\max _{q \in Q} \min _{v \in V} \text{dist}(q, v), \max _{v \in V} \min _{q \in Q} \text{dist}(v, q)\right) \] where \(\text{dist}(q, v)=\|q - v\|_2\) is the Euclidean distance between vectors \(q \in \mathbb{R}^d\) and \(v \in \mathbb{R}^d\). - **Approximate Top - \(k\) vector set search**: \[ R = \{V_1^*, V_2^*, \ldots, V_k^*\}=\arg\min _k \{H_{\text{aus}}(Q, V_i^*) \mid V_i^* \in D\} \] where \(\arg\min _k V_i \in D\) selects \(k\) sets \(V_i\) that minimize \(H_{\text{aus}}(Q, V_i)\), and the distance between \(V_k\) and the query \(Q\) is the \(k\) - th smallest. Through these contributions, this paper provides an effective method to deal with the computational complexity problem of vector set search in high - dimensional space, which has important theoretical and practical application values.

Approximate Vector Set Search: A Bio-Inspired Approach for High-Dimensional Spaces

Efficient Approximate Search for Sets of Vectors

A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Challenge

Fast Search In Large-Scale Image Database Using Vector Quantization

Equal-Average Equal-Variance Equal-Norm Nearest Neighbor Search Algorithm for Vector Quantization

Fast codevector search scheme for 3D mesh model vector quantisation

Survey of Vector Database Management Systems

Locally-Adaptive Quantization for Streaming Vector Search

A Vector Tabu Search Algorithm With Enhanced Searching Ability for Pareto Solutions and Its Application to Multiobjective Optimizations

Vector and Line Quantization for Billion-scale Similarity Search on GPUs

Vector Database Management Techniques and Systems

VectorSearch: Enhancing Document Retrieval with Semantic Embeddings and Optimized Search

Fast High-dimensional Approximate Nearest Neighbor Search with Efficient Index Time and Space

Similarity search in the blink of an eye with compressed indices

VQ Image Coding Using Sub-Vector Techniques.

A Fast Vector Quantization Encoding Algorithm Based on Adaptive Searching Range and Sequence

Augmented Keyword Search on Spatial Entity Databases

Vector Quantization Based On Genetic Simulated Annealing

Indexing very high-dimensional sparse and quasi-sparse vectors for similarity searches

Fast Image Search Using Vector Quantization

Starling: An I/O-Efficient Disk-Resident Graph Index Framework for High-Dimensional Vector Similarity Search on Data Segment