A Framework for Ranking and KNN Queries in a Probabilistic Skyline Model

Jianguo Li,Gabriel Pui Cheong Fung,Wei Zhou,Wei‐Ping Huang
DOI: https://doi.org/10.12733/jcis13726
2015-01-01
Abstract:Skyline computation has gained a lot of attention in recent years. According to the definition of skyline, objects that belong to skyline cannot be ranked among themselves because they are incomparable. This constraint limits the application of skyline. Fortunately, due to the recently proposed probabilistic skyline model, skyline objects which contain multiple elements, can now be compared with each others. Different from the traditional skyline model where each object can either be a skyline object or not, in the probabilistic skyline model, each object is assigned a skyline probability to denote its likelihood of being a skyline object. Under this model, two simple but important questions will naturally be asked: (1) Given an object, which of the objects are the K nearest neighbors to it based on their skyline probabilities? (2) Given an object, what is the ranking of the objects which have skyline probabilities greater than the given object? To the best of our knowledge, no existing work can effectively answer these two questions. Yet, answering them is not trivial. For a medium-size dataset (e.g. 10,000 objects), it may take more than an hour to compute the skyline probabilities of all objects. In this paper, we propose a novel framework to answering the above two questions on the fly efficiently. Our proposed work is based on the idea of bounding-pruning-refining strategy. We first compute the skyline probabilities of the target object and all its elements. For the rest of the objects, instead of computing their accurate skyline probabilities, we compute the upper bound and lower bound skyline probabilities using the elements of the target object. Based on lower bound and upper bound of their skyline probabilities, some objects, which cannot be in the result, will be pruned. For those objects, which we are unknown whether they are in the results or not, we need to refine their bounds. The refinement strategy is based on the idea of space partition. Specifically, we first partition the whole dataspace into several subspaces based on the distribution of elements in the target object. When we iteratively do the the refinement of the bounds, we will do the partitioning strategy in each subspace. In order to implement this framework, a novel tree, called Space Partition Tree (SPTree) is proposed to index the objects and their elements. We evaluate our proposed work using three synthetic datasets and one real-life dataset. We report all our findings in this paper.
What problem does this paper attempt to address?