Efficient and Optimal Algorithms for Tree Summarization with Weighted Terminologies
Xuliang Zhu,Xin Huang,Byron Choi,Jianliang Xu,William K. Cheung,Yanchun Zhang,Jiming Liu
DOI: https://doi.org/10.1109/tkde.2021.3120722
IF: 9.235
2021-01-01
IEEE Transactions on Knowledge and Data Engineering
Abstract:Data summarization that presents a small subset of a dataset to users has been widely applied in numerous applications and systems. Many datasets are coded with hierarchical terminologies, e.g., gene ontology, disease ontology, to name a few. In this paper, we study the weighted tree summarization. We motivate and formulate our ${\mathsf {kWTS}}$ - ${\mathsf {problem}}$ as selecting a diverse set of $k$ nodes to s ummarize a hierarchical t ree $T$ with w eighted terminologies. We first propose an efficient greedy tree summarization algorithm ${\mathsf {GTS}}$ . It solves the problem with $(1-1/e)$ -approximation guarantee. Although ${\mathsf {GTS}}$ achieves quality-guaranteed answers approximately, but it is still not optimal. To tackle the problem optimally, we further develop a dynamic programming algorithm ${\mathsf {OTS}}$ to obtain optimal answers for ${\mathsf {kWTS}}$ - ${\mathsf {problem}}$ in $O(nhk^3)$ time, where $n, h$ are the node size and height in tree $T$ . The algorithm complexity and correctness of ${\mathsf {OTS}}$ are theoretically analyzed. In addition, we propose a useful optimization technique of tree reduction to remove useless nodes with zero weights and shrink the tree into a smaller one, which ensures the efficiency acceleration of both ${\mathsf {GTS}}$ and ${\mathsf {OTS}}$ in real-world datasets. Moreover, we illustrate one useful application of graph visualization based on the answer of $k$ -sized tree summarization and show it in a novel case study. Extensive experimental results on real-world datasets show the effectiveness and efficiency of our proposed approximate and optimal algorithms for tree summarization. Furthermore, we conduct a usability evaluation of attractive topic recommendation on ACM Computing Classification System dataset to validate the usefulness of our model and algorithms.