DIDS: Double Indices and Double Summarizations for Fast Similarity Search
Han Hu,Jiye Qiu,Hongzhi Wang,Bin Liang,Songling Zou
DOI: https://doi.org/10.14778/3665844.3665851
IF: 2.5
2024-05-01
Proceedings of the VLDB Endowment
Abstract:Data series has been one of the significant data forms in various applications. It becomes imperative to devise a data series index that supports both approximate and exact similarity searches for large data series collections in high-dimensional metric spaces. The state-of-the-art works employ summarizations and indices to reduce the accesses to the data series. However, we discover two significant flaws that severely limit performance enhancement. Firstly, the state-of-the-art works often employ segment-based summarizations, whose lower bound distances decrease significantly when representing a data series collection, resulting in numerous invalid accesses. Secondly, the disk-based indices for the exact search mainly rely on tree-based indices, which results in low-quality approximate answers, consequently impacting the exact search. To address these problems, we propose a novel solution, Double Indices and Double Summarizations (DIDS). Besides segment-based summarizations, DIDS introduces reference-point-based summarizations to improve the pruning rate by the sorted-based representation strategy. Moreover, DIDS employs reference points and a cost model to cluster similar data series, and uses a graph-based approach to interconnect various regions, enhancing approximate search capabilities. We conduct experiments on extensive datasets, validating the superior search performance of DIDS.
computer science, information systems, theory & methods