Multi-type Features Based Web Document Clustering

Shen Huang,Gui-Rong Xue,Ben-Yu Zhang,Zheng Chen,Yong Yu,Wei-Ying Ma
DOI: https://doi.org/10.1007/978-3-540-30480-7_27
2004-01-01
Abstract:Clustering has been demonstrated as a feasible way to explore the contents of document collection and organize search engine results. For this task, many features of Web page, such as content, anchor text, URL, hyperlink etc, can be exploited and different results can be obtained. We expect to provide a unified and even better result for end users. Some work have studied how to use several types of features together to perform clustering. Most of them focus on ensemble method or combination of similarity. In this paper, we propose a novel algorithm: Multi-type Features based Reinforcement Clustering (MFRC). This algorithm does not use a unique combine score for all feature spaces, but uses the intermediate clustering result in one feature space as additional information to gradually enhance clustering in other spaces. Finally a consensus can be achieved by such mutual reinforcement. And the experimental results show that MFRC also provides some performance improvement.
What problem does this paper attempt to address?