Transfer Learning for Molecular Property Predictions from Small Data Sets

Thorren Kirschbaum,Annika Bande
2024-10-13
Abstract:Machine learning has emerged as a new tool in chemistry to bypass expensive experiments or quantum-chemical calculations, for example, in high-throughput screening applications. However, many machine learning studies rely on small data sets, making it difficult to efficiently implement powerful deep learning architectures such as message passing neural networks. In this study, we benchmark common machine learning models for the prediction of molecular properties on two small data sets, for which the best results are obtained with the message passing neural network PaiNN, as well as SOAP molecular descriptors concatenated to a set of simple molecular descriptors tailored to gradient boosting with regression trees. To further improve the predictive capabilities of PaiNN, we present a transfer learning strategy that uses large data sets to pre-train the respective models and allows to obtain more accurate models after fine-tuning on the original data sets. The pre-training labels are obtained from computationally cheap ab initio or semi-empirical models and both data sets are normalized to mean zero and standard deviation one to align the labels' distributions. This study covers two small chemistry data sets, the Harvard Organic Photovoltaics data set (HOPV, HOMO-LUMO-gaps), for which excellent results are obtained, and on the Freesolv data set (solvation energies), where this method is less successful, probably due to a complex underlying learning task and the dissimilar methods used to obtain pre-training and fine-tuning labels. Finally, we find that for the HOPV data set, the final training results do not improve monotonically with the size of the pre-training data set, but pre-training with fewer data points can lead to more biased pre-trained models and higher accuracy after fine-tuning.
Machine Learning,Chemical Physics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to improve the performance of machine - learning models when predicting molecular properties on small - data sets. Specifically, the author focuses on how to use machine - learning methods to efficiently predict molecular properties by bypassing expensive experiments or quantum - chemical calculations in high - throughput screening applications. However, many existing machine - learning studies rely on small - data sets, which makes it difficult to effectively implement powerful deep - learning architectures such as Message Passing Neural Networks (MPNNs). To this end, the paper explores through the following two aspects: 1. **Benchmark Testing**: First, the author conducted benchmark tests on the molecular - property - prediction capabilities of common machine - learning models on two small - data sets. These two data sets are the Harvard Organic Photovoltaics (HOPV) data set and the Freesolv data set. The HOPV data set contains the photovoltaic properties of 350 organic molecules, while the Freesolv data set contains the solvation energies of 643 small organic molecules. 2. **Transfer - Learning Strategy**: Second, in order to further improve the prediction capabilities, the author proposed a transfer - learning strategy. This strategy uses a large - data set to pre - train the model and then fine - tunes it on the original data set. The pre - training labels are obtained from computationally less - expensive ab initio or semi - empirical models, and both data sets are normalized to have a mean of 0 and a standard deviation of 1 to align the label distributions. Through the above methods, the author hopes to achieve more accurate molecular - property predictions on small - data sets. In particular, for the HOPV data set, the transfer - learning strategy has achieved significant results. However, for the Freesolv data set, due to the complexity of its learning tasks and the dissimilarity of the pre - training and fine - tuning label methods, the effect of transfer - learning is poorer. In addition, the study also found that the size of the pre - training data set is not necessarily the larger the better, and an appropriate amount of data may be more helpful in improving the generalization ability of the model.