Specialising and Analysing Instruction-Tuned and Byte-Level Language Models for Organic Reaction Prediction

Jiayun Pang,Ivan Vulić
2024-05-17
Abstract:Transformer-based encoder-decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: Can FlanT5 and ByT5, the encode-decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become `chemistry domain compatible' in the process. This suggests that GPU-intensive and expensive pretraining on a large dataset of unlabelled molecules may be useful yet not essential to leverage the power of language models for chemistry. All our models achieve comparable Top-1 and Top-5 accuracy although some variation across different models does exist. Notably, tokenisation and vocabulary trimming slightly affect final performance but can speed up training and inference; The most efficient greedy decoding strategy is very competitive while only marginal gains can be achieved from more sophisticated decoding algorithms. In summary, we evaluate FlanT5 and ByT5 across several dimensions and benchmark their impact on organic reaction prediction, which may guide more effective use of these state-of-the-art language models for chemistry-related tasks in the future.
Computation and Language,Artificial Intelligence,Machine Learning,Biomolecules
What problem does this paper attempt to address?
This paper mainly discusses two issues: 1) whether FlanT5 and ByT5 encoder-decoder models pretrained only on language data can be effectively applied to organic reaction prediction through fine-tuning of specific tasks; 2) how to optimize key steps such as input preprocessing, training data efficiency, pretrained models, and decoding algorithms in chemical reaction prediction. The research found that although FlanT5 and ByT5 were not pretrained on chemical data, they can serve as the basis for fine-tuning and become "chemically compatible" models, indicating that large-scale GPU-intensive pretraining may not be a necessary condition for utilizing language models in chemical tasks. The paper also compared different tokenization methods, vocabulary pruning, training sample efficiency, and decoding algorithms during inference, and found that a simple greedy decoding strategy is very effective, while more complex decoding algorithms only yield minor performance improvements. In addition, the impact of pretraining on model initialization is significant, and cross-domain pretraining may provide a better starting point. In conclusion, the paper systematically evaluates the performance of FlanT5 and ByT5 in organic reaction prediction through empirical research, and provides guidance for future more effective utilization of these advanced language models.