An Efficient Framework for Exact Set Similarity Search Using Tree Structure Indexes.

Yong Zhang,Xiuxing Li,Jin Wang,Ying Zhang,Chunxiao Xing,Xiaojie Yuan
DOI: https://doi.org/10.1109/icde.2017.127
2017-01-01
Abstract:Similarity search is an essential operation in many applications. Given a collection of set records and a query, the exact set similarity search aims at finding all the records that are similar to the query from the collection. Existing methods adopt a filter-and-verify framework, which make use of inverted indexes. However, as the complexity of verification is rather low for set-based similarity metrics, they always fail to make a good tradeoff between filter power and filter cost. In this paper, we proposed an efficient framework for exact set similarity search based on tree index structure. We defined a hash-based ordering to effectively import data into the index structure and then make optimizations to reduce the filter cost. To further improve the filter power, we proposed a dynamic algorithm to partition the dataset into several parts and propose a multiple-index framework. Experimental results on real-world datasets show that our method significantly outperform the state-of-the-art algorithms.
What problem does this paper attempt to address?