Abstract:Objective: Biomedical document conceptualization is the process of clustering biomedical documents based on ontology-represented domain knowledge. The result of this process is the representation of the biomedical documents by a set of key concepts and their relationships. Most of clustering methods cluster documents based on invariant domain knowledge. The objective of this work is to develop an effective method to cluster biomedical documents based on various user-specified ontologies, so that users can exploit the concept structures of documents more effectively. Methods: We develop a flexible framework to allow users to specify the knowledge bases, in the form of ontologies. Based on the user-specified ontologies, we develop a key concept induction algorithm, which uses latent semantic analysis to identify key concepts and cluster documents. A corpus-related ontology generation algorithm is developed to generate the concept structures of documents. Results: Based on two biomedical datasets, we evaluate the proposed method and five other clustering algorithms. The clustering results of the proposed method outperform the five other algorithms, in terms of key concept identification. With respect to the first biomedical dataset, our method has the F-measure values 0.7294 and 0.5294 based on the MeSH ontology and gene ontology (GO), respectively. With respect to the second biomedical dataset, our method has the F-measure values 0.6751 and 0.6746 based on the MeSH ontology and GO, respectively. Both results outperforms the five other algorithms in terms of F-measure. Based on the MeSH ontology and GO, the generated corpus-related ontologies show informative conceptual structures. Conclusions: The proposed method enables users to specify the domain knowledge to exploit the conceptual structures of biomedical document collections. In addition, the proposed method is able to extract the key concepts and cluster the documents with a relatively high precision. (C) 2010 Elsevier By. All rights reserved.

Ontology Based Document Clustering Using MapReduce

Document Clustering Using Locality Preserving Indexing

Distributed Affinity Propagation Clustering Based on MapReduce

Ontology Mapping Method Based on Ontology Partition

Constructing virtual documents for ontology matching using mapreduce

A Semantic approach for effective document clustering using WordNet

VDoc+: a Virtual Document Based Approach for Matching Large Ontologies Using MapReduce.

Medical Document Clustering Using Ontology-Based Term Similarity Measures

An optimized SVM-RFE based feature selection and weighted entropy K-means approach for big data clustering in mapreduce

Concept-Enhanced Multi-view Co-clustering of Document Data

A knowledge-driven approach to biomedical document conceptualization

Automatic Incremental Clustering Using Bat-Grey Wolf Optimizer-Based MapReduce Framework for Effective Management of High-Dimensional Data

An Efficient K-Means Clustering Algorithm On Mapreduce

Information Retrieval in long documents: Word clustering approach for improving Semantics

Big data clustering using fractional sail fish-sparse fuzzy C-means and particle whale optimization based MapReduce framework

Hierarchical Clustering Algorithms for Document Datasets

Enhancing Cluster Quality of Numerical Datasets with Domain Ontology

Exploiting Wikipedia As External Knowledge For Document Clustering

Research and implementation of user clustering based on MapReduce in multimedia big data

An efficient PAM spatial clustering algorithm based on MapReduce

Parallel Subspace Clustering Using MapReduce