Ranking and Clustering in Probabilistic Databases
Jian Li,Barna Saha,Amol Deshpande
2008-01-01
Abstract:The dramatic growth in the number of application domains that naturally generate probabilistic, uncertain data has resulted in a need for efficiently supporting complex querying and decision-making over such data. In this paper, we address the problem of on-the-fly clustering and ranking over probabilistic databases. We begin with a systematic exploration of ranking in probabilistic databases by viewing it as a multi-criteria optimization problem, and by deriving a set of features that capture the key properties of a probabilistic dataset that dictate the ranked result. We contend that a single, specific ranking function may not suffice for probabilistic databasess, and we instead propose two parameterized ranking functions, called PRF w and PRF , that can approximate many of the previously proposed ranking functions. We present several novel algorithms for efficient computing such ranking functions using generating functions, even over databases that exhibit complex correlation patterns modeled using probabilistic and/xor trees or Markov networks. We further propose that the parameters of the ranking function be learned from user preferences, and develop an approach to learn such parameters. We also develop a hierarchical framework for efficiently combining on-the-fly clustering and ranking (called a ClusterRank query) over probabilistic databases. Our framework is based on a general definition of clustering, called restricted soft-t clustering, where a tuple is allowed to participate in at most t clusters. We show how several of our ranking functions can be seamlessly integrated into this framework, which not only allows ranking to continue in parallel with clustering, but also enables pruning of a large portion of the search space. Finally, we present a comprehensive experimental study comparing different ranking functions, and illustrating the effectiveness of our clustering framework.