Query Sampling Based High Dimensional Hybrid Index

Junqi Zhang,Xiangdong Zhou
2007-01-01
Abstract:Sparse and aggregate data exists in the same multimedia data set. Hence, the selection of an appropriate strategy to index such data is very dicult. To solve this problem, we propose a novel hybrid index to speed up processing of high-dimensional K-nearest neighbor(KNN) queries. In the first step the cluster analysis and cluster splitting methods are applied to construct a tree-based index, then the rela- tionship between data distribution and index performance is derived by sampling. At last some tree branches with sparse data are extracted for linear scan, while the aggre- gate data remains in the tree. The complexity of the pro- posed sampling algorithm is only p N (N is the size of data set). The proposed hybrid index improves the query e- ciency by adaptively selection dierent index strategies for the data with dierent distribution. Extensive experiments show that the proposed hybrid index structure performs bet- ter than iDistance, M-Tree and linear scan, and scales better with dimensions. The index is still faster than linear scan when the dimension reaches 336.
What problem does this paper attempt to address?