Data augmentation in a triple transformer loop retrosynthesis model

Jean-Louis Reymond,Yves Grandjean,David Kreutter
DOI: https://doi.org/10.26434/chemrxiv-2024-r3x05
2024-08-01
Abstract:The reaction dataset from the US Patent Office (USPTO), which is used broadly for training computer-assisted synthesis planning (CASP) retrosynthesis models, is biased towards a few over-represented reaction types such as palladium couplings and protecting group operations. Here we applied 14,325 reaction templates extracted from USPTO reactions to 1,505,837 USPTO molecules and used a transformer-based approach derived from our recently reported triple transformer loop (TTL) retrosynthesis model to test and validate up to 5,000 reactions per template. This approach yielded 25.7 million fictive reactions, from which we selected up to 90 reactions per template to form an equilibrated augmented dataset of 1,000,245 reactions. Combining the original USPTO dataset with this augmented dataset by multitask transfer learning produced a new TTL model with increased performance in terms of overall and template averaged single step round-trip accuracy. Further performance increases were obtained by applying a new disconnection-aware forward validation transformer.
Chemistry
What problem does this paper attempt to address?
The paper primarily addresses the issue of dataset bias in Computer-Aided Synthesis Planning (CASP). Specifically, the chemical reaction dataset from the United States Patent and Trademark Office (USPTO), which is widely used for training CASP models, exhibits significant bias. It tends to favor certain common reaction types (such as palladium-coupling reactions and protecting group manipulations), while other reaction types are rarely recorded. This bias limits the model's performance on less common reactions. The proposed solution in the paper is based on data augmentation using the previously reported Triple Transformer Loop (TTL) model. By extracting 14,325 reaction templates from the USPTO dataset and applying these templates to 1,505,837 molecules from the same dataset, a large number of fictitious reactions (approximately 25.7 million) were generated. The transformer T2 in the TTL model was then used to predict possible reagents, and the modified transformer T3* was used to verify the validity of these fictitious reactions. Specifically, if the predicted product matched the original product with a confidence score higher than 95%, the fictitious reaction was considered valid. Using this method, up to 90 valid fictitious reactions per template were selected, forming a balanced dataset containing 1,000,245 valid fictitious reactions. Subsequently, the authors retrained the TTL model using Multitask Transfer Learning (MTL) by combining the original USPTO dataset with this augmented fictitious reaction dataset. The results showed that the TTL model using the fictitious reaction dataset (TTL MT) improved the accuracy of single-step retrosynthesis prediction. The performance improvement was even more significant when using the improved transformer T3*, which can identify reaction sites (TTL MT*). In summary, this study aims to address the bias issue in the USPTO dataset through data augmentation techniques, thereby enhancing the CASP model's ability to handle various reaction types.