Abstract:Document clustering, which is used for topic discovery and similarity computation, has received a great deal of attention in text data management. Methods that have been adopted in traditional clustering, particularly for multi-topic documents, are not viable because the contents that are distinguished by the sub topical structure may not be pertinent across the entire documents. In this paper, a sub-document based framework for clustering multiple documents is proposed in which LDA is used for document segmentation. The proposed improvised framework is a two-way approach to address the clustering problem. First, instead of applying a clustering algorithm to the entire data sets, documents are partitioned into cohesive sub-documents along topic boundaries through text segmentation to establish a two-level representation of text data, i.e., topics and words. Second, the proposed framework is compared to existing clustering methods, both traditional and segment based clustering through different clustering algorithms using the F-measure as the measurement metric. In addition, various real-time data sets that contain multi-topic documents are applied to validating the clustering algorithms through the proposed sub-document based framework. Each sub-document is clustered within a document and the resulting clusters are further clustered across the documents. Experimental results show that the proposed framework outperforms existing clustering approaches in terms of the F-measure as well as efficiency at least 73% with LDA segmentation and bisecting LDA in comparison to TextTiling.

A New Partitioning Based Algorithm for Document Clustering.

Document Clustering Using Locality Preserving Indexing

Adaptive Centroid-Based Clustering Algorithm for Text Document Data.

Clustering Algorithm on Block Division of Documents

A Document Ensemble Clustering Approach Via Dimensionality Reduction

Variant of K-means Algorithm for Document Clustering: Optimization Initial Centers

A Survey of Document Clustering

K-Means Algorithm for Document Clustering with Optimal Initial Values

Hierarchical Clustering Algorithms for Document Datasets

An Adaptive Initial Cluster Centers Selection Algorithm for High-Dimensional Partition Clustering

An Improvised Sub-Document Based Framework for Efficient Document Clustering

An Efficient Hybrid Hierarchical Document Clustering Method

An Improved K-means Algorithm for Document Clustering

An Improved K-Means Algorithm for Documents Clustering

A Clustering Algorithm for Short Documents Based On Concept Similarity

Finding Good Initial Cluster Center by Using Maximum Average Distance.

A feature selection algorithm for document clustering based on word co-occurrence frequency

Document Clustering Based on Semantic Smoothing Approach

An optimized k-means algorithm of reducing cluster intra-dissimilarity for document clustering

Text clustering based on term weights automatic partition

Text Clustering Based on Automatic Partition of Feature Item Weight