Transferring a molecular foundation model for polymer property predictions

Pei Zhang,Logan Kearney,Debsindhu Bhowmik,Zachary Fox,Amit K. Naskar,John Gounley
2023-10-26
Abstract:Transformer-based large language models have remarkable potential to accelerate design optimization for applications such as drug development and materials discovery. Self-supervised pretraining of transformer models requires large-scale datasets, which are often sparsely populated in topical areas such as polymer science. State-of-the-art approaches for polymers conduct data augmentation to generate additional samples but unavoidably incurs extra computational costs. In contrast, large-scale open-source datasets are available for small molecules and provide a potential solution to data scarcity through transfer learning. In this work, we show that using transformers pretrained on small molecules and fine-tuned on polymer properties achieve comparable accuracy to those trained on augmented polymer datasets for a series of benchmark prediction tasks.
Machine Learning,Chemical Physics
What problem does this paper attempt to address?
The paper aims to address the issue of data scarcity in polymer property prediction, particularly the lack of large-scale data required for pre-training models. To overcome this challenge, the researchers explore a method that leverages a foundational model pre-trained on small molecule datasets for transfer learning to improve the accuracy of polymer property prediction. Specifically, the core contributions of the paper are as follows: 1. **Background and Challenges**: - Polymer informatics is a rapidly developing field that uses machine learning methods to discover and design new polymers. - Existing methods for polymer property prediction are often limited by the lack of labeled data (structure-property data). - Large language models like TransPolymer and polyBERT, although achieving state-of-the-art results in polymer property prediction, require a large amount of data for training, which is not easily available for polymers. 2. **Solution**: - The researchers evaluated whether models pre-trained on large-scale small molecule datasets (such as Enamine REAL) could improve polymer property prediction through transfer learning. - The hypothesis is that during the pre-training phase, whether using polymer data or small molecule data, the model primarily learns a general chemical language, which can serve as a good starting point for subsequent property prediction tasks. - In the experiments, the researchers used the BERT model and conducted two stages of training: first, unsupervised pre-training on Enamine REAL; second, supervised fine-tuning on an open-source density functional theory (DFT) dataset. 3. **Experimental Results**: - The results show that the model pre-trained on Enamine REAL (SML-MT) performed comparably or better than TransPolymer and polyBERT on a series of benchmark prediction tasks. - Specifically, among 8 DFT properties, SML-MT achieved the best prediction accuracy on 5 properties, performed comparably on 2 properties, and performed worse on 1 property (crystallization tendency Xc). - This indicates that even without a large-scale pre-training dataset specifically for polymers, transfer learning can effectively address the issue of polymer property prediction. In summary, this paper demonstrates a foundational model approach that leverages large-scale small molecule datasets to overcome the data scarcity issue in polymer property prediction, proving the effectiveness and feasibility of this method.