Representing Document As Dependency Graph for Document Clustering

Yujing Wang,Xiaochuan Ni,Jian-Tao Sun,Yunhai Tong,Zheng Chen
DOI: https://doi.org/10.1145/2063576.2063920
2011-01-01
Abstract:In traditional clustering methods, a document is often represented as "bag of words" (in BOW model) or n-grams (in suffix tree document model) without considering the natural language relationships between the words. In this paper, we propose a novel approach DGDC (Dependency Graph-based Document Clustering algorithm) to address this issue. In our algorithm, each document is represented as a dependency graph where the nodes correspond to words which can be seen as meta-descriptions of the document; whereas the edges stand for the relations between pairs of words. A new similarity measure is proposed to compute the pairwise similarity of documents based on their corresponding dependency graphs. By applying the new similarity measure in the Group-average Agglomerative Hierarchial Clustering (GAHC) algorithm, the final clusters of documents can be obtained. The experiments were carried out on five public document datasets. The empirical results have indicated that the DGDC algorithm can achieve better performance in document clustering tasks compared with other approaches based on the BOW model and suffix tree document model.
What problem does this paper attempt to address?