Abstract:Biomedical and life science literature is an essential way to publish experimental results. With the rapid growth of the number of new publications, the amount of scientific knowledge represented in free text is increasing remarkably. There has been much interest in developing techniques that can extract this knowledge and make it accessible to aid scientists in discovering new relationships between biological entities and answering biological questions. Making use of the word2vec approach, we generated word vector representations based on a corpus consisting of over 16 million PubMed abstracts. We developed a text mining pipeline to produce word2vec embeddings with different properties and performed validation experiments to assess their utility for biomedical analysis. An important pre-processing step consisted in the substitution of synonymous terms by their preferred terms in biomedical databases. Furthermore, we extracted gene-gene networks from two embedding versions and used them as prior knowledge to train Graph-Convolutional Neural Networks (CNNs) on large breast cancer gene expression data and on other cancer datasets. Performances of resulting models were compared to Graph-CNNs trained with protein-protein interaction (PPI) networks or with networks derived using other word embedding algorithms. We also assessed the effect of corpus size on the variability of word representations. Finally, we created a web service with a graphical and a RESTful interface to extract and explore relations between biomedical terms using annotated embeddings. Comparisons to biological databases showed that relations between entities such as known PPIs, signaling pathways and cellular functions, or narrower disease ontology groups correlated with higher cosine similarity. Graph-CNNs trained with word2vec-embedding-derived networks performed sufficiently good for the metastatic event prediction tasks compared to other networks. Such performance was good enough to validate the utility of our generated word embeddings in constructing biological networks. Word representations as produced by text mining algorithms like word2vec, therefore are able to capture biologically meaningful relations between entities. Our generated embeddings are publicly available at https://github.com/genexplain/Word2vec-based-Networks/blob/main/README.md .

Annotating publicly-available samples and studies using interpretable modeling of unstructured metadata

Enhancing Biomedical Knowledge Discovery for Diseases: An Open-Source Framework Applied on Rett Syndrome and Alzheimer's Disease

Ontology-based Annotation and Retrieval for Large-Scale VCF Data

Automated Harmonization and Large-Scale Integration of Heterogeneous Biomedical Sample Metadata Using Large Language Models

FasTag: Automatic text classification of unstructured medical narratives

A Metadata Extraction Approach for Clinical Case Reports to Enable Advanced Understanding of Biomedical Concepts

Automated annotation of disease subtypes

Accelerating Clinical Text Annotation in Underrepresented Languages: A Case Study on Text De-Identification

Automated Annotation of Scientific Texts for ML-based Keyphrase Extraction and Validation

The text2term tool to map free-text descriptions of biomedical terms to ontologies

Two Approaches for Biomedical Text Classification

OPA2Vec: combining formal and informal content of biomedical ontologies to improve similarity-based prediction

Natural Language Processing to extract SNOMED-CT codes from pathological reports

Prioritization, clustering and functional annotation of MicroRNAs using latent semantic indexing of MEDLINE abstracts

TeamTat: a collaborative text annotation tool

Annotating and detecting phenotypic information for chronic obstructive pulmonary disease

OntoSem: an Ontology Semantic Representation Methodology for Biomedical Domain

Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks

Advancing equity in breast cancer care: natural language processing for analysing treatment outcomes in under-represented populations

Using text embedding models as text classifiers with medical data

An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB