Abstract:Background: For selection and evaluation of potential biomarkers, inclusion of already published information is of utmost importance. In spite of significant advancements in text- and data-mining techniques, the vast knowledge space of biomarkers in biomedical text has remained unexplored. Existing named entity recognition approaches are not sufficiently selective for the retrieval of biomarker information from the literature. The purpose of this study was to identify textual features that enhance the effectiveness of biomarker information retrieval for different indication areas and diverse end user perspectives. Methods: A biomarker terminology was created and further organized into six concept classes. Performance of this terminology was optimized towards balanced selectivity and specificity. The information retrieval performance using the biomarker terminology was evaluated based on various combinations of the terminology's six classes. Further validation of these results was performed on two independent corpora representing two different neurodegenerative diseases. Results: The current state of the biomarker terminology contains 119 entity classes supported by 1890 different synonyms. The result of information retrieval shows improved retrieval rate of informative abstracts, which is achieved by including clinical management terms and evidence of gene/protein alterations (e.g. gene/protein expression status or certain polymorphisms) in combination with disease and gene name recognition. When additional filtering through other classes (e.g. diagnostic or prognostic methods) is applied, the typical high number of unspecific search results is significantly reduced. The evaluation results suggest that this approach enables the automated identification of biomarker information in the literature. A demo version of the search engine SCAIView, including the biomarker retrieval, is made available to the public through http://www.scaiview.com/scaiview-academia.html. Conclusions: The approach presented in this paper demonstrates that using a dedicated biomarker terminology for automated analysis of the scientific literature maybe helpful as an aid to finding biomarker information in text. Successful extraction of candidate biomarkers information from published resources can be considered as the first step towards developing novel hypotheses. These hypotheses will be valuable for the early decision-making in the drug discovery and development process.

Mining Disease-Specific Molecular Association Profiles from Biomedical Literature: A Case Study

Discovering breast cancer drug candidates from biomedical literature.

A MeSH-based Biomedical Literature Mining Method for Exploring Associations Between Genes and Clinical Terms

Large-scale literature mining to assess the relation between anti-cancer drugs and cancer types

A Probabilistic Model for Mining Implicit 'chemical Compound-Gene' Relations from Literature

Application Of A New Probabilistic Model For Mining Implicit Associated Cancer Genes From Omim And Medline

Literature mining discerns latent disease–gene relationships

Building Disease-Specific Drug-Protein Connectivity Maps from Molecular Interaction Networks and PubMed Abstracts

Mining Relational Paths in Integrated Biomedical Data

The research on gene-disease association based on text-mining of PubMed

DisGeReExT: a knowledge discovery system for exploration of disease–gene associations through large-scale literature-wide analysis study

Mining Functional Relationships in Feature Subspaces from Gene Expression Profiles and Drug Activity Profiles

Text mining for finding functional community of related genes using TCM knowledge

Clinic-genomic Association Mining for Colorectal Cancer Using Publicly Available Datasets.

Quantitative measurement of clinic-genomic association for colorectal cancer using literature mining and Google-distance algorithm

Extending the boundaries of cancer therapeutic complexity with literature text mining

A Mixture Language Model for Class-Attribute Mining from Biomedical Literature Digital Library

Predicting implicit associated cancer genes from OMIM and MEDLINE by a new probabilistic model

Mining biomarker information in biomedical literature

Target discovery from data mining approaches

Automatic extraction of gene-disease associations from literature using joint ensemble learning