Mining The Most General Multidimensional Summarization of "Probable Groups" in Data Warehouses

Hui Yu,Jian Pei,Shiwei Tang,Dongqing Yang
2005-01-01
Abstract:Data summarization is an important data analysis task in data warehousing and online analytic processing. In this paper, we consider a novel type of summarization queries, probable group queries, such as "What are the groups of patients that have a50% or more opportunity to get lung cancer than the average?" An aggregate cell satisfying the requirement is called aprobable group. To make the answer succinct and effective, we propose that only the most gen- eral probable groups should be mined. For example, if both groups (smoking, drinking) and (smoking, *) are probable, then the former groups should not be returned. The problem of mining the most general probable groups is challenging since the probable groups can be widely scattered in the cube lattice, and do not present any monotonicity in group containment order. We extend the state-of-the-art BUC al- gorithm to tackle the problem, and develop techniques and heuristics to speed up the search. An extensive performance study is reported to illustrate the effect of our approach.
What problem does this paper attempt to address?