Abstract:Motivation: Automatically quantifying semantic similarity and relatedness between clinical terms is an important aspect of text mining from electronic health records, which are increasingly recognized as valuable sources of phenotypic information for clinical genomics and bioinformatics research. A key obstacle to development of semantic relatedness measures is the limited availability of large quantities of clinical text to researchers and developers outside of major medical centers. Text from general English and biomedical literature are freely available; however, their validity as a substitute for clinical domain to represent semantics of clinical terms remains to be demonstrated. Results: We constructed neural network representations of clinical terms found in a publicly available benchmark dataset manually labeled for semantic similarity and relatedness. Similarity and relatedness measures computed from text corpora in three domains (Clinical Notes, PubMed Central articles and Wikipedia) were compared using the benchmark as reference. We found that measures computed from full text of biomedical articles in PubMed Central repository (rho = 0.62 for similarity and 0.58 for relatedness) are on par with measures computed from clinical reports (rho = 0.60 for similarity and 0.57 for relatedness). We also evaluated the use of neural network based relatedness measures for query expansion in a clinical document retrieval task and a biomedical term word sense disambiguation task. We found that, with some limitations, biomedical articles may be used in lieu of clinical reports to represent the semantics of clinical terms and that distributional semantic methods are useful for clinical and biomedical natural language processing applications. Availability and implementation: The software and reference standards used in this study to evaluate semantic similarity and relatedness measures are publicly available as detailed in the article. Contact: pakh0002@umn.eduSupplementary information: Supplementary data are available at Bioinformatics online.

Performance of Stanford and Minipar Parser on Biomedical Texts

Parts-of-Speech Tagger Errors Do Not Necessarily Degrade Accuracy in Extracting Information from Biomedical Text

Comparison of Syntactic Parsers on Biomedical Texts

From POS tagging to dependency parsing for biomedical event extraction

Dependency Parsing with Partial Annotations: an Empirical Comparison.

A Hybrid Method for Relation Extraction from Biomedical Literature

A Modest Pareto Optimisation Analysis of Dependency Parsers in 2021

Monte Carlo Syntax Marginals for Exploring and Using Dependency Parses

Statistical Decision-Tree Models for Parsing

Towards Effective Sentence Simplification for Automatic Processing of Biomedical Text

Semantic Predications for Complex Information Needs in Biomedical Literature

Natural language processing to extract medical problems from electronic clinical documents: Performance evaluation

Speeding Up Natural Language Parsing by Reusing Partial Results

Comparative Analytical Study of Cutting-Edge Dependency Parsing for Nature Language Processing

Empirical Analysis for Unsupervised Universal Dependency Parse Tree Aggregation

Neural Maximum Subgraph Parsing for Cross-Domain Semantic Dependency Analysis.

How Important Is POS to Dependency Parsing? Joint POS Tagging and Dependency Parsing Neural Networks

ParsRec: A Novel Meta-Learning Approach to Recommending Bibliographic Reference Parsers

Corpus domain effects on distributional semantic modeling of medical terms

An Empirical Comparison of Probability Models for Dependency Grammar