Abstract:Reaction diagram parsing is the task of extracting reaction schemes from a diagram in the chemistry literature. The reaction diagrams can be arbitrarily complex, thus robustly parsing them into structured data is an open challenge. In this paper, we present RxnScribe, a machine learning model for parsing reaction diagrams of varying styles. We formulate this structured prediction task with a sequence generation approach, which condenses the traditional pipeline into an end-to-end model. We train RxnScribe on a dataset of 1,378 diagrams and evaluate it with cross validation, achieving an 80.0% soft match F1 score, with significant improvements over previous models. Our code and data are publicly available at <a class="link-external link-https" href="https://github.com/thomas0809/RxnScribe" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **Automatically analyze and extract chemical reaction information from reaction maps in chemical literature**. Specifically, the task is to identify reactants, conditions, and products from reaction maps in the form of images and structure this information into a readable data format. ### Problem Background In chemical literature, new reactions and synthetic pathways are usually presented in the form of maps. These maps can be very complex, including single - line, multi - line, tree - like or graph - like and other styles (as shown in Figure 1). Therefore, how to robustly analyze these maps and convert them into structured data is an open - ended challenge. Traditional methods are usually rule - based pipeline processing. First, molecular images are segmented, then their chemical structures are identified, and finally reactants, conditions, and products are determined according to the positional relationship. However, this method is easily affected by changes in drawing styles, resulting in poor performance. ### Paper Solution This paper proposes a machine - learning model named **RxnScribe** for analyzing reaction maps of different styles. The author defines this structured prediction task as a sequence - generation problem and designs an end - to - end model to simplify the traditional multi - stage pipeline. Specifically: - **Input**: The image of the reaction map. - **Output**: A sequence representing the reaction structure, where each entity (molecule, text, etc.) is described by its bounding - box coordinates and type. - **Model Architecture**: Adopts an encoder - decoder structure. The encoder uses a convolutional neural network (CNN) to abstract the input image, and the decoder uses a Transformer network to autoregressively generate the output sequence. - **Training Data**: Collected 1,378 reaction maps, covering four common styles, and labeled reactants, conditions, and products. ### Main Contributions 1. **Sequence - Generation Model**: By representing the reaction structure as a sequence, the error - propagation problem in traditional methods is avoided. 2. **Data Augmentation**: Proposed a data - augmentation strategy of combining simple maps to generate more complex maps to improve the generalization ability of the model. 3. **Significant Performance Improvement**: In the cross - validation experiment, RxnScribe achieved a soft - matching F1 score of 80.0%, significantly outperforming existing models. ### Summary RxnScribe aims to provide a general and robust machine - learning solution that can analyze reaction maps of various styles in chemical literature and convert them into structured reaction information. Compared with traditional rule - based methods, RxnScribe achieves higher accuracy and better generalization ability through an end - to - end sequence - generation model.

RxnScribe: A Sequence Generation Model for Reaction Diagram Parsing

Leveraging Reaction-aware Substructures for Retrosynthesis Analysis

Retrosynthetic reaction prediction using neural sequence-to-sequence models

Rxn-INSIGHT: fast chemical reaction analysis using bond-electron matrices

Automated Chemical Reaction Extraction from Scientific Literature

Permutation invariant graph-to-sequence model for template-free retrosynthesis and reaction prediction

Inferring experimental procedures from text-based representations of chemical reactions

Bridging the Gap between Chemical Reaction Pretraining and Conditional Molecule Generation with a Unified Model

ReactXT: Understanding Molecular "Reaction-ship" via Reaction-Contextualized Molecule-Text Pretraining

ReactionDataExtractor 2.0: A Deep Learning Approach for Data Extraction from Chemical Reaction Schemes

A Generative Model For Electron Paths

ReactionDataExtractor: A Tool for Automated Extraction of Information from Chemical Reaction Schemes

ChemiRise: a data-driven retrosynthesis engine

Extracting Structured Data from Organic Synthesis Procedures Using a Fine-Tuned Large Language Model

Hierarchical Framework for Retrosynthesis Prediction with Enhanced Reaction Center Localization

Reaxtica: A Knowledge-Guided Machine Learning Platform for Fast and Accurate Reaction Selectivity and Yield Prediction

T-Rex: Text-assisted Retrosynthesis Prediction

Root-aligned SMILES: a Tight Representation for Chemical Reaction Prediction

Chemistry-informed molecular graph as reaction descriptor for machine-learned retrosynthesis planning

Chemical reaction enhanced graph learning for molecule representation

Modelling Chemical Reasoning to Predict Reactions