Abstract:MOTIVATION:Clustering MEDLINE documents is usually conducted by the vector space model, which computes the content similarity between two documents by basically using the inner-product of their word vectors. Recently, the semantic information of MeSH (Medical Subject Headings) thesaurus is being applied to clustering MEDLINE documents by mapping documents into MeSH concept vectors to be clustered. However, current approaches of using MeSH thesaurus have two serious limitations: first, important semantic information may be lost when generating MeSH concept vectors, and second, the content information of the original text has been discarded.METHODS:Our new strategy includes three key points. First, we develop a sound method for measuring the semantic similarity between two documents over the MeSH thesaurus. Second, we combine both the semantic and content similarities to generate the integrated similarity matrix between documents. Third, we apply a spectral approach to clustering documents over the integrated similarity matrix.RESULTS:Using various 100 datasets of MEDLINE records, we conduct extensive experiments with changing alternative measures and parameters. Experimental results show that integrating the semantic and content similarities outperforms the case of using only one of the two similarities, being statistically significant. We further find the best parameter setting that is consistent over all experimental conditions conducted. We finally show a typical example of resultant clusters, confirming the effectiveness of our strategy in improving MEDLINE document clustering.SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.

Information Retrieval in long documents: Word clustering approach for improving Semantics

Document Clustering Using Locality Preserving Indexing

A Semantic approach for effective document clustering using WordNet

Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach

WordNet and Semantic similarity based approach for document clustering

Semantic smoothing of document models for agglomerative clustering

Concept-Enhanced Multi-view Co-clustering of Document Data

Document Clustering Based on Word Sense Cluster

Clustering-based Semantic Retrieval Algorithm

Enhancing Medline Document Clustering by Incorporating Mesh Semantic Similarity

Representing Documents and Queries as Sets of Word Embedded Vectors for Information Retrieval

Semantic Smoothing for Model-based Document Clustering

Document Clustering Based on Semantic Smoothing Approach

Improving search result clustering using nature inspired approach

A Clustering Algorithm for Short Documents Based On Concept Similarity

Clustering articles based on semantic similarity

State of the art document clustering algorithms based on semantic similarity

Identifying Bengali Multiword Expressions using Semantic Clustering

An End-to-End Efficient Lucene-Based Framework of Document/Information Retrieval

Concept-based indexing in text information retrieval

Semantic Term "Blurring" and Stochastic "Barcoding" for Improved Unsupervised Text Classification