Specialising and Analysing Instruction-Tuned and Byte-Level Language Models for Organic Reaction Prediction

Jiayun Pang,Ivan Vulić

2024-05-17

Abstract:Transformer-based encoder-decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: Can FlanT5 and ByT5, the encode-decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become `chemistry domain compatible' in the process. This suggests that GPU-intensive and expensive pretraining on a large dataset of unlabelled molecules may be useful yet not essential to leverage the power of language models for chemistry. All our models achieve comparable Top-1 and Top-5 accuracy although some variation across different models does exist. Notably, tokenisation and vocabulary trimming slightly affect final performance but can speed up training and inference; The most efficient greedy decoding strategy is very competitive while only marginal gains can be achieved from more sophisticated decoding algorithms. In summary, we evaluate FlanT5 and ByT5 across several dimensions and benchmark their impact on organic reaction prediction, which may guide more effective use of these state-of-the-art language models for chemistry-related tasks in the future.

Computation and Language,Artificial Intelligence,Machine Learning,Biomolecules

What problem does this paper attempt to address?

This paper mainly discusses two issues: 1) whether FlanT5 and ByT5 encoder-decoder models pretrained only on language data can be effectively applied to organic reaction prediction through fine-tuning of specific tasks; 2) how to optimize key steps such as input preprocessing, training data efficiency, pretrained models, and decoding algorithms in chemical reaction prediction. The research found that although FlanT5 and ByT5 were not pretrained on chemical data, they can serve as the basis for fine-tuning and become "chemically compatible" models, indicating that large-scale GPU-intensive pretraining may not be a necessary condition for utilizing language models in chemical tasks. The paper also compared different tokenization methods, vocabulary pruning, training sample efficiency, and decoding algorithms during inference, and found that a simple greedy decoding strategy is very effective, while more complex decoding algorithms only yield minor performance improvements. In addition, the impact of pretraining on model initialization is significant, and cross-domain pretraining may provide a better starting point. In conclusion, the paper systematically evaluates the performance of FlanT5 and ByT5 in organic reaction prediction through empirical research, and provides guidance for future more effective utilization of these advanced language models.

Specialising and Analysing Instruction-Tuned and Byte-Level Language Models for Organic Reaction Prediction

Specialising and Analysing Instruction-Tuned and Byte-Level Language Models for Organic Reaction Prediction

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

Transformer Performance for Chemical Reactions: Analysis of Different Predictive and Evaluation Scenarios

ReactionT5: a large-scale pre-trained model towards application of limited reaction data

Harnessing Data Augmentation and Normalization Preprocessing to Improve the Performance of Chemical Reaction Predictions of Data-Driven Model

Unified Deep Learning Model for Multitask Reaction Predictions with Explanation

Exploring the Benefits of Domain-Pretraining of Generative Large Language Models for Chemistry

"Found in Translation": Predicting Outcomes of Complex Organic Chemistry Reactions using Neural Sequence-to-Sequence Models

Pre-training Transformers for Molecular Property Prediction Using Reaction Prediction

Prediction of chemical reaction yields using deep learning

State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis

ReactXT: Understanding Molecular "Reaction-ship" via Reaction-Contextualized Molecule-Text Pretraining

Learning Chemical Reaction Representation with Reactant-Product Alignment

Bridging the Gap between Chemical Reaction Pretraining and Conditional Molecule Generation with a Unified Model

Molecular Transformer - A Model for Uncertainty-Calibrated Chemical Reaction Prediction

Discovering Photoswitchable Molecules for Drug Delivery with Large Language Models and Chemist Instruction Training

Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining

Fine-tuning Large Language Models for Chemical Text Mining

Developing Deep Learning-Based Large-Scale Organic Reaction Classification Model Via Sigma-Profiles