Abstract:Machine learning has emerged as a new tool in chemistry to bypass expensive experiments or quantum-chemical calculations, for example, in high-throughput screening applications. However, many machine learning studies rely on small data sets, making it difficult to efficiently implement powerful deep learning architectures such as message passing neural networks. In this study, we benchmark common machine learning models for the prediction of molecular properties on two small data sets, for which the best results are obtained with the message passing neural network PaiNN, as well as SOAP molecular descriptors concatenated to a set of simple molecular descriptors tailored to gradient boosting with regression trees. To further improve the predictive capabilities of PaiNN, we present a transfer learning strategy that uses large data sets to pre-train the respective models and allows to obtain more accurate models after fine-tuning on the original data sets. The pre-training labels are obtained from computationally cheap ab initio or semi-empirical models and both data sets are normalized to mean zero and standard deviation one to align the labels' distributions. This study covers two small chemistry data sets, the Harvard Organic Photovoltaics data set (HOPV, HOMO-LUMO-gaps), for which excellent results are obtained, and on the Freesolv data set (solvation energies), where this method is less successful, probably due to a complex underlying learning task and the dissimilar methods used to obtain pre-training and fine-tuning labels. Finally, we find that for the HOPV data set, the final training results do not improve monotonically with the size of the pre-training data set, but pre-training with fewer data points can lead to more biased pre-trained models and higher accuracy after fine-tuning.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is how to improve the performance of machine - learning models when predicting molecular properties on small - data sets. Specifically, the author focuses on how to use machine - learning methods to efficiently predict molecular properties by bypassing expensive experiments or quantum - chemical calculations in high - throughput screening applications. However, many existing machine - learning studies rely on small - data sets, which makes it difficult to effectively implement powerful deep - learning architectures such as Message Passing Neural Networks (MPNNs). To this end, the paper explores through the following two aspects: 1. **Benchmark Testing**: First, the author conducted benchmark tests on the molecular - property - prediction capabilities of common machine - learning models on two small - data sets. These two data sets are the Harvard Organic Photovoltaics (HOPV) data set and the Freesolv data set. The HOPV data set contains the photovoltaic properties of 350 organic molecules, while the Freesolv data set contains the solvation energies of 643 small organic molecules. 2. **Transfer - Learning Strategy**: Second, in order to further improve the prediction capabilities, the author proposed a transfer - learning strategy. This strategy uses a large - data set to pre - train the model and then fine - tunes it on the original data set. The pre - training labels are obtained from computationally less - expensive ab initio or semi - empirical models, and both data sets are normalized to have a mean of 0 and a standard deviation of 1 to align the label distributions. Through the above methods, the author hopes to achieve more accurate molecular - property predictions on small - data sets. In particular, for the HOPV data set, the transfer - learning strategy has achieved significant results. However, for the Freesolv data set, due to the complexity of its learning tasks and the dissimilarity of the pre - training and fine - tuning label methods, the effect of transfer - learning is poorer. In addition, the study also found that the size of the pre - training data set is not necessarily the larger the better, and an appropriate amount of data may be more helpful in improving the generalization ability of the model.

Transfer Learning for Molecular Property Predictions from Small Data Sets

Fast and Effective Molecular Property Prediction with Transferability Map

Transfer learning with graph neural networks for improved molecular property prediction in the multi-fidelity setting

Understanding the Limitations of Deep Models for Molecular Property Prediction: Insights and Solutions.

Scalable Multi-Task Transfer Learning for Molecular Property Prediction

Transfer learning based on atomic feature extraction for the prediction of experimental ¹³C chemical shifts

Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT

Transferable Multilevel Attention Neural Network for Accurate Prediction of Quantum Chemistry Properties via Multitask Learning

Transfer learning for chemically accurate interatomic neural network potentials

Advanced deep learning methods for molecular property prediction

Transfer learning on large datasets for the accurate prediction of material properties

Transferring chemical and energetic knowledge between molecular systems with machine learning

Transfer learning with graph neural networks for optoelectronic properties of conjugated oligomers

Transferability of Atom-Based Neural Networks

Building Chemical Property Models for Energetic Materials from Small Datasets Using a Transfer Learning Approach

Transfer learning for small molecule retention predictions

From Molecules to Materials: Pre-training Large Generalizable Models for Atomic Property Prediction

Transferring a molecular foundation model for polymer property predictions

Improving neural network predictions of material properties with limited data using transfer learning

Pre-training Transformers for Molecular Property Prediction Using Reaction Prediction

Supervised Pretraining for Molecular Force Fields and Properties Prediction