RxnScribe: A Sequence Generation Model for Reaction Diagram Parsing

Yujie Qian,Jiang Guo,Zhengkai Tu,Connor W. Coley,Regina Barzilay
2023-05-20
Abstract:Reaction diagram parsing is the task of extracting reaction schemes from a diagram in the chemistry literature. The reaction diagrams can be arbitrarily complex, thus robustly parsing them into structured data is an open challenge. In this paper, we present RxnScribe, a machine learning model for parsing reaction diagrams of varying styles. We formulate this structured prediction task with a sequence generation approach, which condenses the traditional pipeline into an end-to-end model. We train RxnScribe on a dataset of 1,378 diagrams and evaluate it with cross validation, achieving an 80.0% soft match F1 score, with significant improvements over previous models. Our code and data are publicly available at <a class="link-external link-https" href="https://github.com/thomas0809/RxnScribe" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **Automatically analyze and extract chemical reaction information from reaction maps in chemical literature**. Specifically, the task is to identify reactants, conditions, and products from reaction maps in the form of images and structure this information into a readable data format. ### Problem Background In chemical literature, new reactions and synthetic pathways are usually presented in the form of maps. These maps can be very complex, including single - line, multi - line, tree - like or graph - like and other styles (as shown in Figure 1). Therefore, how to robustly analyze these maps and convert them into structured data is an open - ended challenge. Traditional methods are usually rule - based pipeline processing. First, molecular images are segmented, then their chemical structures are identified, and finally reactants, conditions, and products are determined according to the positional relationship. However, this method is easily affected by changes in drawing styles, resulting in poor performance. ### Paper Solution This paper proposes a machine - learning model named **RxnScribe** for analyzing reaction maps of different styles. The author defines this structured prediction task as a sequence - generation problem and designs an end - to - end model to simplify the traditional multi - stage pipeline. Specifically: - **Input**: The image of the reaction map. - **Output**: A sequence representing the reaction structure, where each entity (molecule, text, etc.) is described by its bounding - box coordinates and type. - **Model Architecture**: Adopts an encoder - decoder structure. The encoder uses a convolutional neural network (CNN) to abstract the input image, and the decoder uses a Transformer network to autoregressively generate the output sequence. - **Training Data**: Collected 1,378 reaction maps, covering four common styles, and labeled reactants, conditions, and products. ### Main Contributions 1. **Sequence - Generation Model**: By representing the reaction structure as a sequence, the error - propagation problem in traditional methods is avoided. 2. **Data Augmentation**: Proposed a data - augmentation strategy of combining simple maps to generate more complex maps to improve the generalization ability of the model. 3. **Significant Performance Improvement**: In the cross - validation experiment, RxnScribe achieved a soft - matching F1 score of 80.0%, significantly outperforming existing models. ### Summary RxnScribe aims to provide a general and robust machine - learning solution that can analyze reaction maps of various styles in chemical literature and convert them into structured reaction information. Compared with traditional rule - based methods, RxnScribe achieves higher accuracy and better generalization ability through an end - to - end sequence - generation model.