HAP: an Efficient Hamming Space Index Based on Augmented Pigeonhole Principle

Qiyu Liu,Yanyan Shen,Lei Chen
DOI: https://doi.org/10.1145/3514221.3517880
2022-01-01
Abstract:The emerging deep learning techniques prefer mapping complex data objects (e.g., images, documents) to compact binary vectors (i.e., hash codes) for efficient similarity search. In this paper, we study the problem of indexing large-scale binary databases to support fast Hamming distance-based similarity queries. Existing Hamming space indices usually divide long binary vectors into short disjoint pieces and apply the Pigeonhole Principle to prune unnecessary candidates. In our work, we relax the disjoint partition constraint by allowing dimension redundancy, which yields a tighter pruning bound named Augmented Pigeonhole Principle (APP). Intuitively, APP enables more optimization opportunities by capturing the correlation between database and query workloads. Based on APP, we propose HAP, an efficient Hamming space index framework to support both Hamming range queries and k-NN queries. To guide index construction and run-time query optimization, we introduce a novel DL-base query cardinality estimator named SimCardNet. To further reduce the index space cost, we propose a learned index compression scheme by combining the piece-wise linear approximation (PLA) and Elias-Fano encoding. In addition, we also study the problem of optimizing the execution time of a batch of queries using our index framework. The experimental results on large-scale binary databases reveal that our indexing scheme outperforms the state-of-the-art baselines in terms of both space and time efficiency.
What problem does this paper attempt to address?