EnzChemRED, a rich enzyme chemistry relation extraction dataset

Po-Ting Lai,Elisabeth Coudert,Lucila Aimo,Kristian Axelsen,Lionel Breuza,Edouard de Castro,Marc Feuermann,Anne Morgat,Lucille Pourcel,Ivo Pedruzzi,Sylvain Poux,Nicole Redaschi,Catherine Rivoire,Anastasia Sveshnikova,Chih-Hsuan Wei,Robert Leaman,Ling Luo,Zhiyong Lu,Alan Bridge

2024-04-22

Abstract:Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated using identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI). We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text (Named Entity Recognition, or NER) and to extract the chemical conversions in which they participate (Relation Extraction, or RE), with average F1 score of 86.30% for NER, 86.66% for RE for chemical conversion pairs, and 83.79% for RE for chemical conversion pairs and linked enzymes. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text and apply this to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea. The EnzChemRED corpus is freely available at

Computation and Language

What problem does this paper attempt to address?

The paper presents EnzChemRED, a dataset for enzyme-chemical relationship extraction, aiming to address the problem of automatically extracting enzyme function knowledge from scientific literature to assist expert annotation. Current methods cannot keep up with the speed of new discoveries and publications, while natural language processing techniques, such as pre-trained language models, can accelerate this process. EnzChemRED consists of 1210 expert-annotated PubMed abstracts, annotating enzymes and the chemical reactions they catalyze. By fine-tuning language models using this dataset, the ability to identify proteins, chemicals, and extract chemical transformations can be improved. The paper also proposes an end-to-end knowledge extraction pipeline and applies it to large-scale PubMed abstracts to create a preliminary map of enzyme functions, guiding the annotation work of UniProtKB and Rhea databases.

EnzChemRED, a rich enzyme chemistry relation extraction dataset

BioRED: a rich biomedical relation extraction dataset

Enzyme annotation in UniProtKB using Rhea

OpenChemIE: An Information Extraction Toolkit For Chemistry Literature

ReactZyme: A Benchmark for Enzyme-Reaction Prediction

Automated Chemical Reaction Extraction from Scientific Literature

BioREx: Improving Biomedical Relation Extraction by Leveraging Heterogeneous Datasets

Larval Growth in Postlarvae of Penaeus indicus on Exposure to Lead

FoodChem: A food-chemical relation extraction model

End-to-End Models for Chemical-Protein Interaction Extraction: Better Tokenization and Span-Based Pipeline Strategies

A large-scale evaluation of NLP-derived chemical-gene/protein relationships from the scientific literature: Implications for knowledge graph construction

CEAR: Automatic construction of a knowledge graph of chemical entities and roles from scientific literature

Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach

Docred: A Large-Scale Document-Level Relation Extraction Dataset

NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature

The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII

CARE: a Benchmark Suite for the Classification and Retrieval of Enzymes

Relation Extraction in underexplored biomedical domains: A diversity-optimised sampling and synthetic data generation approach

Descriptor-augmented machine learning for enzyme-chemical interaction predictions

ReactIE: Enhancing Chemical Reaction Extraction with Weak Supervision

SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents