Abstract:WEB DOCUMENT CLUSTERING USING HYPERLINK STRUCTURES XIAOFENG HE y , HONGYUAN ZHA , CHRIS H.Q. DING y AND HORST D. SIMON y Abstract. With the exponential growth of information on the World Wide Web, there is great demand for developing e cient and e ective methods for organizing and retrieving the information available. Document clustering plays an important role in information retrieval and taxonomy man- agement for the World Wide Web and remains an interesting and challenging problem in the eld of web computing. In this paper we consider document clustering methods exploring textual infor- mation, hyperlink structure and co-citation relations. In particular, we apply the normalized-cut clustering method developed in computer vision to the task of hyperdocument clustering. We also explore some theoretical connections of the normalized-cut method to K-means method. We then experiment with normalized-cut method in the context of clustering query result sets for web search engines. Keywords. World Wide Web, graph partitioning, cheeger constant, clustering method, K-means method, normalized cut method, eigenvalue decomposition, power method. 1. Introduction. Currently the World Wide Web contains billions of documents and it is still growing rapidly. Finding the relevant documents to satisfy a user's infor- mation need is a very important and challenging task. Many commercial search en- gines have been developed and used by millions of people all over the world. However, the relevancy of documents returned in search engine result sets is still lacking, and further research and development is needed to really make search engines a ubiquitous information-seeking tool. The World Wide Web has a rich structure: it contains both textual web documents and the hyperlinks that connect them. The web documents and hyperlinks between them form a directed graph in which the web documents can be viewed as vertices and the hyperlinks as directed edges. Algorithms have been de- veloped utilizing this directed graph to extract information contained in a collection of hyperlinked web documents. Kleinberg proposed HITS algorithm based purely on hyperlink information to retrieve the most relevant information: authority and hub documents for a user query 20]. However, if the hypertext collection consists of sev- eral topics, authority and hub documents may only cover the most popular topics and leave out the less popular ones. One way to remedy this situation is to rst partition the hypertext collection into topical groups, and present the search results as a list of topics to the user. This leads to the need to cluster web documents based on both the textual and hyperlink information. There exists a large literature on clustering methods and algorithms 13, 19]. Gen- erally speaking, the purpose of cluster analysis is to organize the data into meaningful groups: the data objects in the same group are highly similar and those in di erent groups are dissimilar. Judging the e ectiveness of a clustering algorithm is di cult and usually application-dependent. In this paper, we apply a similarity-based cluster- ing method to the problem of clustering web documents. It utilizes a graph-theoretic criterion called normalized cut which has its root in the study of graph isoperimetric Department of Computer Science and Engineering, The Pennsylvania State University, Uni- versity Park, PA 16802, f xhe,zha g @cse.psu.edu . This work was supported in part by NSF grant CCR-9901986. y NERSC Division, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA 94720, f xfhe,chqding,hdsimon g @lbl.gov . Supported by Department of Energy through an LBL LDRD fund.

Web Documents Mining

Web Documents Clustering with Interest Links

Web Document Clustering Based on Web Log Mining

Research on Web Document Clustering Based on Sentential Maximum Frequent Word Sets

A New Document Clustering Algorithm Based on Association Rule

Clustering analysis and association mining for Web service composition

Mining frequent association tag sequences for clustering XML documents

Web Pages Clustering and Concepts Mining: an Approach Towards Intelligent Information Retrieval

Web Search Clustering and Labeling with Hidden Topics

Mining Subtopics from Text Fragments for a Web Query

Short Text Feature Extraction and Clustering for Web Topic Mining.

Web Usage Mining Based on WAN Users' Behaviors

Web mining: knowledge discovery on the Web

A Hybrid Strategy for Clustering Data Mining Documents.

RESEARCH ON WEB MINING: A SURVEY

Web Document Clustering Using Hyperlink Structures

Web document clustering algorithm based on semantic similarity

Document clustering based on Web-log mining

Efficient mining of cross-transaction web usage patterns in large database

A Novel Web Usage Mining Method Based on Web Session Clustering

Document clustering Applied to Search Results on the Web