Abstract:Objective: Biomedical document conceptualization is the process of clustering biomedical documents based on ontology-represented domain knowledge. The result of this process is the representation of the biomedical documents by a set of key concepts and their relationships. Most of clustering methods cluster documents based on invariant domain knowledge. The objective of this work is to develop an effective method to cluster biomedical documents based on various user-specified ontologies, so that users can exploit the concept structures of documents more effectively. Methods: We develop a flexible framework to allow users to specify the knowledge bases, in the form of ontologies. Based on the user-specified ontologies, we develop a key concept induction algorithm, which uses latent semantic analysis to identify key concepts and cluster documents. A corpus-related ontology generation algorithm is developed to generate the concept structures of documents. Results: Based on two biomedical datasets, we evaluate the proposed method and five other clustering algorithms. The clustering results of the proposed method outperform the five other algorithms, in terms of key concept identification. With respect to the first biomedical dataset, our method has the F-measure values 0.7294 and 0.5294 based on the MeSH ontology and gene ontology (GO), respectively. With respect to the second biomedical dataset, our method has the F-measure values 0.6751 and 0.6746 based on the MeSH ontology and GO, respectively. Both results outperforms the five other algorithms in terms of F-measure. Based on the MeSH ontology and GO, the generated corpus-related ontologies show informative conceptual structures. Conclusions: The proposed method enables users to specify the domain knowledge to exploit the conceptual structures of biomedical document collections. In addition, the proposed method is able to extract the key concepts and cluster the documents with a relatively high precision. (C) 2010 Elsevier By. All rights reserved.

Functional Structure Identification of Scientific Documents in Computer Science.

Document Clustering Using Locality Preserving Indexing

Research on the Structure Recognition of Academic Texts Under Different Characteristics

Object Recognition from Scientific Document based on Compartment Refinement Framework

A Rule-Based Information Extraction System for Human-Readable Semi-Structured Scientific Documents

DocStruct: A Multimodal Method to Extract Hierarchy Structure in Document for General Form Understanding

Structure extraction from PDF-based book documents.

DWSA: An Intelligent Document Structural Analysis Model for Information Extraction and Data Mining

Hierarchical Logical Structure Extraction of Book Documents by Analyzing Tables of Contents

Identification of secondary functional areas and functional structure analysis based on multisource geographic data

Optimized Hierarchy Clustering Based Extraction for Logical Document Structures

A New Method to Query Document Database by Content and Structure

A Method to Query Document Database by Content and Structure

Multi-documents Automatic Abstracting Based on Text Clustering and Semantic Analysis

Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis

Unsupervised discovery of information structure in biomedical documents

Enhancing Identification of Structure Function of Academic Articles Using Contextual Information

A Study on Construction and Analysis of Discipline Knowledge Structure of Chinese LIS Based on CSSCI

Logical Structure Based Semantic Relationship Extraction from Semi-Structured Documents

Query-focused Summarisation in Research Articles Based on Semantic Function of Sentences

A knowledge-driven approach to biomedical document conceptualization