DiskANN++: Efficient Page-based Search over Isomorphic Mapped Graph Index using Query-sensitivity Entry Vertex

Jiongkang Ni,Xiaoliang Xu,Yuxiang Wang,Can Li,Jiajie Yao,Shihai Xiao,Xuecang Zhang
2023-11-30
Abstract:Given a vector dataset $\mathcal{X}$ and a query vector $\vec{x}_q$, graph-based Approximate Nearest Neighbor Search (ANNS) aims to build a graph index $G$ and approximately return vectors with minimum distances to $\vec{x}_q$ by searching over $G$. The main drawback of graph-based ANNS is that a graph index would be too large to fit into the memory especially for a large-scale $\mathcal{X}$. To solve this, a Product Quantization (PQ)-based hybrid method called DiskANN is proposed to store a low-dimensional PQ index in memory and retain a graph index in SSD, thus reducing memory overhead while ensuring a high search accuracy. However, it suffers from two I/O issues that significantly affect the overall efficiency: (1) long routing path from an entry vertex to the query's neighborhood that results in large number of I/O requests and (2) redundant I/O requests during the routing process. We propose an optimized DiskANN++ to overcome above issues. Specifically, for the first issue, we present a query-sensitive entry vertex selection strategy to replace DiskANN's static graph-central entry vertex by a dynamically determined entry vertex that is close to the query. For the second I/O issue, we present an isomorphic mapping on DiskANN's graph index to optimize the SSD layout and propose an asynchronously optimized Pagesearch based on the optimized SSD layout as an alternative to DiskANN's beamsearch. Comprehensive experimental studies on eight real-world datasets demonstrate our DiskANN++'s superiority on efficiency. We achieve a notable 1.5 X to 2.2 X improvement on QPS compared to DiskANN, given the same accuracy constraint.
Information Retrieval,Databases
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper primarily addresses two key issues in large-scale Approximate Nearest Neighbor Search (ANNS) based on graph indexing: 1. **Long Routing Path Problem**: - **Background**: DiskANN uses static central vertices as entry points. When the query vertex is far from the central vertex, it forms a long routing path, leading to a large number of I/O requests. - **Solution**: A Query-Sensitive Entry Vertex Selection strategy is proposed, which dynamically determines the entry vertex at runtime to shorten the routing path. 2. **Redundant I/O Requests Problem**: - **Background**: DiskANN's SSD layout randomly allocates vertices, resulting in a large number of redundant I/O requests during the query refinement phase. - **Solution**: An Isomorphic Mapping method is proposed to optimize the SSD layout of the Vamana graph index. Based on the optimized layout, a new page-level search algorithm (Pagesearch) is designed to reduce redundant I/O requests. With these two improvements, DiskANN++ significantly increases the Queries Per Second (QPS). Experiments on multiple real-world datasets show that compared to DiskANN, DiskANN++ improves QPS by 1.5 to 2.2 times under the same accuracy constraints.