Abstract:The reaction dataset from the US Patent Office (USPTO), which is used broadly for training computer-assisted synthesis planning (CASP) retrosynthesis models, is biased towards a few over-represented reaction types such as palladium couplings and protecting group operations. Here we applied 14,325 reaction templates extracted from USPTO reactions to 1,505,837 USPTO molecules and used a transformer-based approach derived from our recently reported triple transformer loop (TTL) retrosynthesis model to test and validate up to 5,000 reactions per template. This approach yielded 25.7 million fictive reactions, from which we selected up to 90 reactions per template to form an equilibrated augmented dataset of 1,000,245 reactions. Combining the original USPTO dataset with this augmented dataset by multitask transfer learning produced a new TTL model with increased performance in terms of overall and template averaged single step round-trip accuracy. Further performance increases were obtained by applying a new disconnection-aware forward validation transformer.

What problem does this paper attempt to address?

The paper primarily addresses the issue of dataset bias in Computer-Aided Synthesis Planning (CASP). Specifically, the chemical reaction dataset from the United States Patent and Trademark Office (USPTO), which is widely used for training CASP models, exhibits significant bias. It tends to favor certain common reaction types (such as palladium-coupling reactions and protecting group manipulations), while other reaction types are rarely recorded. This bias limits the model's performance on less common reactions. The proposed solution in the paper is based on data augmentation using the previously reported Triple Transformer Loop (TTL) model. By extracting 14,325 reaction templates from the USPTO dataset and applying these templates to 1,505,837 molecules from the same dataset, a large number of fictitious reactions (approximately 25.7 million) were generated. The transformer T2 in the TTL model was then used to predict possible reagents, and the modified transformer T3* was used to verify the validity of these fictitious reactions. Specifically, if the predicted product matched the original product with a confidence score higher than 95%, the fictitious reaction was considered valid. Using this method, up to 90 valid fictitious reactions per template were selected, forming a balanced dataset containing 1,000,245 valid fictitious reactions. Subsequently, the authors retrained the TTL model using Multitask Transfer Learning (MTL) by combining the original USPTO dataset with this augmented fictitious reaction dataset. The results showed that the TTL model using the fictitious reaction dataset (TTL MT) improved the accuracy of single-step retrosynthesis prediction. The performance improvement was even more significant when using the improved transformer T3*, which can identify reaction sites (TTL MT*). In summary, this study aims to address the bias issue in the USPTO dataset through data augmentation techniques, thereby enhancing the CASP model's ability to handle various reaction types.

Data augmentation in a triple transformer loop retrosynthesis model

State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis

Harnessing Data Augmentation and Normalization Preprocessing to Improve the Performance of Chemical Reaction Predictions of Data-Driven Model

RetroPrime: A Diverse, Plausible and Transformer-based Method for Single-Step Retrosynthesis Predictions

RetroPrime: A Chemistry-Inspired and Transformer-based Method for Retrosynthesis Predictions

Reagent prediction with a molecular transformer improves reaction data quality

Retrosynthesis prediction enhanced by in-silico reaction data augmentation

Chemoenzymatic Multistep Retrosynthesis with Transformer Loops

Dissecting Errors in Machine Learning for Retrosynthesis: A Granular Metric Framework and Transformer-Based Model for More Informative Predictions

Transformer Performance for Chemical Reactions: Analysis of Different Predictive and Evaluation Scenarios

Permutation invariant graph-to-sequence model for template-free retrosynthesis and reaction prediction

Predicting Enzymatic Reactions with a Molecular Transformer

ReactionT5: a large-scale pre-trained model towards application of limited reaction data

Virtual data augmentation method for reaction prediction

RetroCaptioner: beyond attention in end-to-end retrosynthesis transformer via contrastively captioned learnable graph representation

CTsynther: Contrastive Transformer model for end-to-end retrosynthesis prediction

Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis

Beyond the Typical: Modeling Rare Plausible Patterns in Chemical Reactions by Leveraging Sequential Mixture-of-Experts

Retroformer: Pushing the Limits of Interpretable End-to-end Retrosynthesis Transformer