Abstract:Molecular property prediction is essential in chemistry, especially for drug discovery applications. However, available molecular property data is often limited, encouraging the transfer of information from related data. Transfer learning has had a tremendous impact in fields like Computer Vision and Natural Language Processing signaling for its potential in molecular property prediction. We present a pre-training procedure for molecular representation learning using reaction data and use it to pre-train a SMILES Transformer. We fine-tune and evaluate the pre-trained model on 12 molecular property prediction tasks from MoleculeNet within physical chemistry, biophysics, and physiology and show a statistically significant positive effect on 5 of the 12 tasks compared to a non-pre-trained baseline model.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the data scarcity problem in molecular property prediction. Specifically, molecular property prediction is crucial in the field of chemistry, especially in drug discovery, but the available molecular property data is usually limited. This restricts the training effect and generalization ability of the model. To solve this problem, the author proposes a pre - training strategy based on reaction prediction, uses the Transformer model for molecular representation learning, and applies it to the molecular property prediction task. ### Main problems 1. **Data scarcity problem**: The amount of data required for molecular property prediction is usually insufficient, resulting in difficulties in model training. 2. **Application of transfer learning**: Although transfer learning has achieved great success in fields such as computer vision and natural language processing, its application in the field of chemistry is still immature. ### Solutions The author proposes a new pre - training method by using chemical reaction data to pre - train the Transformer model. The specific steps are as follows: - **Pre - training stage**: Use chemical reaction data (including reactants and reagents) as input to generate the SMILES representation of product molecules. In this way, the model can learn the relationships and transformation rules between molecules. - **Fine - tuning stage**: Fine - tune the pre - trained model on 12 molecular property prediction tasks in MoleculeNet and compare it with the non - pre - trained baseline model. ### Experimental results The author conducted experiments on 12 molecular property prediction tasks, and the results showed that the pre - trained model was significantly superior to the non - pre - trained model on 5 tasks. These tasks cover multiple fields such as physical chemistry, biophysics, and physiology. ### Formula representation Some formulas involved in the paper are represented in Markdown format as follows: - The calculation formula of the polymerization reaction vector \( h_R \): \[ h_R=\text{Aggregate}(H)=\text{Mean}(\text{Sum}(H)) \] - Significance level after Bonferroni correction: \[ \alpha_1 = \alpha_2=\cdots=\alpha_{12}=\frac{\alpha}{m}=\frac{0.05}{12}=0.00417 \] Through this method, the author demonstrates the potential of the pre - training strategy based on reaction prediction in molecular property prediction and provides a valuable reference for further research.

Pre-training Transformers for Molecular Property Prediction Using Reaction Prediction

ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction

Harnessing Data Augmentation and Normalization Preprocessing to Improve the Performance of Chemical Reaction Predictions of Data-Driven Model

Transformers for molecular property prediction: Lessons learned from the past five years

Molecular Descriptors Property Prediction Using Transformer-Based Approach

MolTRES: Improving Chemical Language Representation Learning for Molecular Property Prediction

Transformer Performance for Chemical Reactions: Analysis of Different Predictive and Evaluation Scenarios

Transferring a molecular foundation model for polymer property predictions

Predicting Enzymatic Reactions with a Molecular Transformer

Reagent prediction with a molecular transformer improves reaction data quality

KnoMol: A Knowledge-Enhanced Graph Transformer for Molecular Property Prediction

Fast and Effective Molecular Property Prediction with Transferability Map

Advanced deep learning methods for molecular property prediction

KPGT: Knowledge-Guided Pre-training of Graph Transformer for Molecular Property Prediction

Dual-view Molecular Pre-training

ReactionT5: a large-scale pre-trained model towards application of limited reaction data

Relative molecule self-attention transformer

Specialising and Analysing Instruction-Tuned and Byte-Level Language Models for Organic Reaction Prediction

Enhancing Molecular Property Prediction through Task-Oriented Transfer Learning: Integrating Universal Structural Insights and Domain-Specific Knowledge

Few-shot learning with transformers via graph embeddings for molecular property prediction