Abstract:Transformer-based deep neural networks have revolutionized the field of molecular-related prediction tasks by treating molecules as symbolic sequences. These models have been successfully applied in various organic chemical applications by pretraining them with extensive compound libraries and subsequently fine-tuning them with smaller in-house datasets for specific tasks. However, many conventional methods primarily focus on single molecules, with limited exploration of pretraining for reactions involving multiple molecules. In this paper, we propose ReactionT5, a novel model that leverages pretraining on the Open Reaction Database (ORD), a publicly available large-scale resource. We further fine-tune this model for yield prediction and product prediction tasks, demonstrating its impressive performance even with limited fine-tuning data compared to traditional models. The pre-trained ReactionT5 model is publicly accessible on the Hugging Face platform.

What problem does this paper attempt to address?

The paper aims to address two key tasks in organic chemistry reaction prediction: product prediction and yield prediction, particularly improving model performance when fine-tuning data is limited. To achieve this goal, the research team developed a novel pre-trained model called ReactionT5. ### Research Background Traditional methods for predicting organic chemistry reactions rely on the experience and knowledge of chemists, making it difficult to handle the complexity and diversity of reactions. In recent years, deep learning methods have shown great potential in handling large chemical reaction datasets, supporting chemists through data-driven analysis. Although significant progress has been made with deep learning models based on single molecules, there is relatively little research on pre-trained models for chemical reactions involving multiple molecules. ### Main Contributions 1. **ReactionT5 Model**: This is a model based on the Transformer architecture. It is first pre-trained using a large compound library and then further pre-trained using the publicly available Open Reaction Database (ORD). The pre-trained model can be fine-tuned on smaller datasets to meet specific task requirements. 2. **Two-Stage Pre-Training Process**: - **Compound Pre-Training**: Pre-training is performed using compounds from the ZINC database to enable the model to understand and represent single molecular structures. - **Reaction Pre-Training**: Building on this, the model is further pre-trained using reaction data from the ORD database to learn the interactions between multiple molecules. 3. **Addressing Unclassified Compounds in the ORD Database**: The ORD database contains many compounds with unclear roles (e.g., reactants, reagents). The researchers developed an auxiliary model called RestorationT5 to classify these unclassified compounds as reactants or reagents, thereby expanding the amount of data available for training. 4. **Product Prediction and Yield Prediction Tasks**: For the product prediction task, the model aims to predict the SMILES representation of the product from given reactants, catalysts, reagents, and solvent information. For the yield prediction task, the goal is to predict the yield of the chemical reaction. ### Experimental Results - **Product Prediction**: Experiments on the USPTO dataset show that even with minimal fine-tuning data, the restored ReactionT5 achieves performance comparable to other models trained on the full dataset. - **Yield Prediction**: Experimental results on the Buchwald-Hartwig C-N coupling reaction dataset indicate that ReactionT5 performs well not only on randomly split datasets but also demonstrates excellent generalization ability on more challenging external test sets. In summary, the proposed ReactionT5 model effectively leverages large-scale public data for pre-training and significantly improves the performance of organic chemistry reaction prediction tasks with minimal fine-tuning data.

ReactionT5: a large-scale pre-trained model towards application of limited reaction data

Unified Deep Learning Model for Multitask Reaction Predictions with Explanation

Specialising and Analysing Instruction-Tuned and Byte-Level Language Models for Organic Reaction Prediction

Harnessing Data Augmentation and Normalization Preprocessing to Improve the Performance of Chemical Reaction Predictions of Data-Driven Model

Reagent prediction with a molecular transformer improves reaction data quality

Pre-training Transformers for Molecular Property Prediction Using Reaction Prediction

ReactXT: Understanding Molecular "Reaction-ship" via Reaction-Contextualized Molecule-Text Pretraining

Chemical Graph-Based Transformer Models for Yield Prediction of High-Throughput Cross-Coupling Reaction Datasets

Learning Chemical Reaction Representation with Reactant-Product Alignment

A deep learning framework for accurate reaction prediction and its application on high-throughput experimentation data

ORDerly: Data Sets and Benchmarks for Chemical Reaction Data

Prediction of chemical reaction yields using deep learning

Transformer Performance for Chemical Reactions: Analysis of Different Predictive and Evaluation Scenarios

ConvPred: A Deep Learning‐based Framework for Predictions of Potential Organic Reactions

Prediction of chemical reaction yields with large-scale multi-view pre-training

Bridging the Gap between Chemical Reaction Pretraining and Conditional Molecule Generation with a Unified Model

ReactAIvate: A Deep Learning Approach to Predicting Reaction Mechanisms and Unmasking Reactivity Hotspots

Multimodal Transformer-based Model for Buchwald-Hartwig and Suzuki-Miyaura Reaction Yield Prediction

Generic Interpretable Reaction Condition Predictions with Open Reaction Condition Datasets and Unsupervised Learning of Reaction Center

A chemical reaction entity recognition method based on a natural language data augmentation strategy