Beyond associations: A benchmark Causal Relation Extraction Dataset (CRED) of disease-causing genes, its comparative evaluation, interpretation and application

Nency Bansal,Sri Dhinesh R C,Ayush Pathak,Manikandan Narayanan
DOI: https://doi.org/10.1101/2024.09.17.613424
2024-09-21
Abstract:Information on causal relationships is essential to many sciences (including biomedical science, where knowing if a gene-disease relation is causal vs. merely associative can lead to better treatments); and can foster research on causal side-information-based machine learning as well. Automatically extracting causal relations from large text corpora remains less explored though, despite much work on Relation Extraction (RE). The few existing CRE (Causal RE) studies are limited to extracting causality within a sentence or for a particular disease, mainly due to the lack of a diverse benchmark dataset. Here, we carefully curate a new CRE Dataset (CRED) of 3553 (causal and non-causal) gene-disease pairs, spanning 284 diseases and 500 genes, within or across sentences of 267 published abstracts. CRED is assembled in two phases to reduce class imbalance and its inter-annotator agreement is 89%. To assess CRED's utility in classifying causal vs. non-causal pairs, we compared multiple classifiers and found SVM to perform the best (F1 score 0.70). Both in terms of classifier performance and model interpretability (i.e., whether the model focuses importance/attention on words with causal connotations in abstracts), CRED outperformed a state-of-the-art RE dataset. To move from benchmarks to real-world settings, our CRED-trained classification model was applied on all PubMed abstracts on Parkinson's disease (PD). Genes predicted to be causal for PD by our model in at least 50 abstracts got validated in textbook sources. Besides these well-studied genes, our model revealed less-studied genes that could be explored further. Our systematically curated and evaluated CRED, and its associated classification model and CRED-wide gene-disease causality scores, thus offer concrete resources for advancing future research in CRE from biomedical literature.
Bioinformatics
What problem does this paper attempt to address?
The paper aims to address the problem of automatically extracting causal relationships from biomedical literature, specifically distinguishing between causal and associative relationships in gene-disease pairs. Specifically, the research team created a new benchmark dataset, CRED (Causal Relation Extraction Dataset), which includes 3,553 gene-disease pairs, covering 284 diseases and 500 genes. These data were extracted from 267 published abstracts and encompass both intra-sentence and inter-sentence causal relationships. The researchers found that the Support Vector Machine (SVM) performed best in distinguishing causal from non-causal relationships (F1 score of 0.70) by comparing various classifiers. Additionally, the study validated the effectiveness of the trained model in predicting genes related to Parkinson's disease and revealed some less-studied genes that might have potential causal relationships. This indicates that the CRED dataset and its associated classification models can provide valuable resources for future research on extracting causal relationships from biomedical literature.