Web Document Clustering Using Hyperlink Structures
Xiaofeng He,Hongyuan Zha,Chris Ding,Horst D. Simon
2001-01-01
Abstract:WEB DOCUMENT CLUSTERING USING HYPERLINK STRUCTURES XIAOFENG HE y , HONGYUAN ZHA , CHRIS H.Q. DING y AND HORST D. SIMON y Abstract. With the exponential growth of information on the World Wide Web, there is great demand for developing e cient and e ective methods for organizing and retrieving the information available. Document clustering plays an important role in information retrieval and taxonomy man- agement for the World Wide Web and remains an interesting and challenging problem in the eld of web computing. In this paper we consider document clustering methods exploring textual infor- mation, hyperlink structure and co-citation relations. In particular, we apply the normalized-cut clustering method developed in computer vision to the task of hyperdocument clustering. We also explore some theoretical connections of the normalized-cut method to K-means method. We then experiment with normalized-cut method in the context of clustering query result sets for web search engines. Keywords. World Wide Web, graph partitioning, cheeger constant, clustering method, K-means method, normalized cut method, eigenvalue decomposition, power method. 1. Introduction. Currently the World Wide Web contains billions of documents and it is still growing rapidly. Finding the relevant documents to satisfy a user's infor- mation need is a very important and challenging task. Many commercial search en- gines have been developed and used by millions of people all over the world. However, the relevancy of documents returned in search engine result sets is still lacking, and further research and development is needed to really make search engines a ubiquitous information-seeking tool. The World Wide Web has a rich structure: it contains both textual web documents and the hyperlinks that connect them. The web documents and hyperlinks between them form a directed graph in which the web documents can be viewed as vertices and the hyperlinks as directed edges. Algorithms have been de- veloped utilizing this directed graph to extract information contained in a collection of hyperlinked web documents. Kleinberg proposed HITS algorithm based purely on hyperlink information to retrieve the most relevant information: authority and hub documents for a user query 20]. However, if the hypertext collection consists of sev- eral topics, authority and hub documents may only cover the most popular topics and leave out the less popular ones. One way to remedy this situation is to rst partition the hypertext collection into topical groups, and present the search results as a list of topics to the user. This leads to the need to cluster web documents based on both the textual and hyperlink information. There exists a large literature on clustering methods and algorithms 13, 19]. Gen- erally speaking, the purpose of cluster analysis is to organize the data into meaningful groups: the data objects in the same group are highly similar and those in di erent groups are dissimilar. Judging the e ectiveness of a clustering algorithm is di cult and usually application-dependent. In this paper, we apply a similarity-based cluster- ing method to the problem of clustering web documents. It utilizes a graph-theoretic criterion called normalized cut which has its root in the study of graph isoperimetric Department of Computer Science and Engineering, The Pennsylvania State University, Uni- versity Park, PA 16802, f xhe,zha g @cse.psu.edu . This work was supported in part by NSF grant CCR-9901986. y NERSC Division, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA 94720, f xfhe,chqding,hdsimon g @lbl.gov . Supported by Department of Energy through an LBL LDRD fund.