Pre-training Transformers for Molecular Property Prediction Using Reaction Prediction

Johan Broberg,Maria Bånkestad,Erik Ylipää
DOI: https://doi.org/10.48550/arXiv.2207.02724
2022-07-06
Abstract:Molecular property prediction is essential in chemistry, especially for drug discovery applications. However, available molecular property data is often limited, encouraging the transfer of information from related data. Transfer learning has had a tremendous impact in fields like Computer Vision and Natural Language Processing signaling for its potential in molecular property prediction. We present a pre-training procedure for molecular representation learning using reaction data and use it to pre-train a SMILES Transformer. We fine-tune and evaluate the pre-trained model on 12 molecular property prediction tasks from MoleculeNet within physical chemistry, biophysics, and physiology and show a statistically significant positive effect on 5 of the 12 tasks compared to a non-pre-trained baseline model.
Machine Learning,Chemical Physics,Biomolecules
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the data scarcity problem in molecular property prediction. Specifically, molecular property prediction is crucial in the field of chemistry, especially in drug discovery, but the available molecular property data is usually limited. This restricts the training effect and generalization ability of the model. To solve this problem, the author proposes a pre - training strategy based on reaction prediction, uses the Transformer model for molecular representation learning, and applies it to the molecular property prediction task. ### Main problems 1. **Data scarcity problem**: The amount of data required for molecular property prediction is usually insufficient, resulting in difficulties in model training. 2. **Application of transfer learning**: Although transfer learning has achieved great success in fields such as computer vision and natural language processing, its application in the field of chemistry is still immature. ### Solutions The author proposes a new pre - training method by using chemical reaction data to pre - train the Transformer model. The specific steps are as follows: - **Pre - training stage**: Use chemical reaction data (including reactants and reagents) as input to generate the SMILES representation of product molecules. In this way, the model can learn the relationships and transformation rules between molecules. - **Fine - tuning stage**: Fine - tune the pre - trained model on 12 molecular property prediction tasks in MoleculeNet and compare it with the non - pre - trained baseline model. ### Experimental results The author conducted experiments on 12 molecular property prediction tasks, and the results showed that the pre - trained model was significantly superior to the non - pre - trained model on 5 tasks. These tasks cover multiple fields such as physical chemistry, biophysics, and physiology. ### Formula representation Some formulas involved in the paper are represented in Markdown format as follows: - The calculation formula of the polymerization reaction vector \( h_R \): \[ h_R=\text{Aggregate}(H)=\text{Mean}(\text{Sum}(H)) \] - Significance level after Bonferroni correction: \[ \alpha_1 = \alpha_2=\cdots=\alpha_{12}=\frac{\alpha}{m}=\frac{0.05}{12}=0.00417 \] Through this method, the author demonstrates the potential of the pre - training strategy based on reaction prediction in molecular property prediction and provides a valuable reference for further research.