ReactionT5: a large-scale pre-trained model towards application of limited reaction data

Tatsuya Sagawa,Ryosuke Kojima
2023-11-12
Abstract:Transformer-based deep neural networks have revolutionized the field of molecular-related prediction tasks by treating molecules as symbolic sequences. These models have been successfully applied in various organic chemical applications by pretraining them with extensive compound libraries and subsequently fine-tuning them with smaller in-house datasets for specific tasks. However, many conventional methods primarily focus on single molecules, with limited exploration of pretraining for reactions involving multiple molecules. In this paper, we propose ReactionT5, a novel model that leverages pretraining on the Open Reaction Database (ORD), a publicly available large-scale resource. We further fine-tune this model for yield prediction and product prediction tasks, demonstrating its impressive performance even with limited fine-tuning data compared to traditional models. The pre-trained ReactionT5 model is publicly accessible on the Hugging Face platform.
Chemical Physics,Machine Learning
What problem does this paper attempt to address?
The paper aims to address two key tasks in organic chemistry reaction prediction: product prediction and yield prediction, particularly improving model performance when fine-tuning data is limited. To achieve this goal, the research team developed a novel pre-trained model called ReactionT5. ### Research Background Traditional methods for predicting organic chemistry reactions rely on the experience and knowledge of chemists, making it difficult to handle the complexity and diversity of reactions. In recent years, deep learning methods have shown great potential in handling large chemical reaction datasets, supporting chemists through data-driven analysis. Although significant progress has been made with deep learning models based on single molecules, there is relatively little research on pre-trained models for chemical reactions involving multiple molecules. ### Main Contributions 1. **ReactionT5 Model**: This is a model based on the Transformer architecture. It is first pre-trained using a large compound library and then further pre-trained using the publicly available Open Reaction Database (ORD). The pre-trained model can be fine-tuned on smaller datasets to meet specific task requirements. 2. **Two-Stage Pre-Training Process**: - **Compound Pre-Training**: Pre-training is performed using compounds from the ZINC database to enable the model to understand and represent single molecular structures. - **Reaction Pre-Training**: Building on this, the model is further pre-trained using reaction data from the ORD database to learn the interactions between multiple molecules. 3. **Addressing Unclassified Compounds in the ORD Database**: The ORD database contains many compounds with unclear roles (e.g., reactants, reagents). The researchers developed an auxiliary model called RestorationT5 to classify these unclassified compounds as reactants or reagents, thereby expanding the amount of data available for training. 4. **Product Prediction and Yield Prediction Tasks**: For the product prediction task, the model aims to predict the SMILES representation of the product from given reactants, catalysts, reagents, and solvent information. For the yield prediction task, the goal is to predict the yield of the chemical reaction. ### Experimental Results - **Product Prediction**: Experiments on the USPTO dataset show that even with minimal fine-tuning data, the restored ReactionT5 achieves performance comparable to other models trained on the full dataset. - **Yield Prediction**: Experimental results on the Buchwald-Hartwig C-N coupling reaction dataset indicate that ReactionT5 performs well not only on randomly split datasets but also demonstrates excellent generalization ability on more challenging external test sets. In summary, the proposed ReactionT5 model effectively leverages large-scale public data for pre-training and significantly improves the performance of organic chemistry reaction prediction tasks with minimal fine-tuning data.