Abstract:In this paper an approach that is using evolving, incremental (on-line) clustering to automatically group relevant Web-based documents is proposed. It is centred on a recently introduced evolving fuzzy rule-based clustering approach and borrows heavily from the Nature in the sense that it is evolution-inspired. That is, the structure of the clusters and their number is not predefined, but it self-develops, they grow and shrink when new web documents are accessed. Existing Web-based search engine technology returns long lists of web pages that contain the user's search term query but are not presented in order of contextual similarity. For example, search terms that have more than one meaning such as "cold" are presented to the user in a list containing documents relating to the Cold War and the common cold. If these results could be clustered "on the fly" then this improved presentation of results would allow the end user to find relevant documents more easily by requiring the inspection of one cluster of contextually similar documents rather than entire list of documents containing information pertaining to irrelevant contexts. An issue that is paid a special attention to is the similarity measure between the textual documents. Euclidean, Levenstein, and Cosine similarity measures have been used with cosine dissimilarity/distance performing best and addressing the problem of different number of features in each document. The proposed evolving classifier has also learning capability - it improves the result on-line with any new document that has been accessed. Finally, the proposed approach is characterized by low complexity. This paper reports the results of research that is going on for more than two years at Lancaster University on development of a novel clustering method that is suitable for real-time implementations. It is based on evolution principles and tries to address the limitations of existing clustering algorithms which cannot cope in an online mode with high dimensional datasets. This evolution-inspired and Nature-inspired approach introduces the new concept of potential values which describes the fitness of a new sample (web document) to be the prototype of a new cluster without the need to store each previously encountered documents but taking into account the contextual similarity density between all previous documents in a recursive and thus computationally efficient way (thereby reducing memory requirements and improving speed compared with existing approaches). This paper examines the clustering of documents by contextual similarity using extracted keywords represented in a vector space model.

An effective web document clustering algorithm based on bisection and merge

Document Clustering Using Locality Preserving Indexing

Correlation-Based Web Document Clustering for Adaptive Web Interface Design

On-line Evolving Clustering of Web Documents

Web Documents Clustering with Interest Links

Clustering Algorithm on Block Division of Documents

An efficient clustering algorithm for large-scale topical web pages.

Hierarchical Clustering Algorithms for Document Datasets

Use Link-Based Clustering To Improve Web Search Results

Web page clustering enhanced by summarization.

On Combining Link and Contents Information for Web Page Clustering

Design and simulation of a document clustering algorithm based on genetic algorithm

Fuzzy clustering of web documents using equivalence relations and fuzzy hierarchical clustering

Web Image Clustering by Consistent Utilization of Visual Features and Surrounding Texts.

Multi-type Features Based Web Document Clustering

Towards effective document clustering: A constrained K-means based approach

PCCS：A FAST CLUSTERING AND CLASSIFICATION METHOD FOR WEB DOCUMENT

An Improved K-Means Algorithm for Documents Clustering

An Efficient Clustering Algorithm for Small Text Documents

A Unified Framework for Clustering Heterogeneous Web Objects

Quantitative Evaluation of Performance and Validity Indices for Clustering the Web Navigational Sessions