Characterizing the Dilemma of Performance and Index Size in Billion-Scale Vector Search and Breaking It with Second-Tier Memory

Rongxin Cheng,Yifan Peng,Xingda Wei,Hongrui Xie,Rong Chen,Sijie Shen,Haibo Chen
2024-05-07
Abstract:Vector searches on large-scale datasets are critical to modern online services like web search and RAG, which necessity storing the datasets and their index on the secondary storage like SSD. In this paper, we are the first to characterize the trade-off of performance and index size in existing SSD-based graph and cluster indexes: to improve throughput by 5.7$\times$ and 1.7$\times$, these indexes have to pay a 5.8$\times$ storage amplification and 7.7$\times$ with respect to the dataset size, respectively. The root cause is that the coarse-grained access of SSD mismatches the fine-grained random read required by vector indexes with small amplification.
Distributed, Parallel, and Cluster Computing,Databases,Information Retrieval
What problem does this paper attempt to address?