Triple Generative Self-Supervised Learning Method for Molecular Property Prediction

Lei Xu,Leiming Xia,Shourun Pan,Zhen Li
DOI: https://doi.org/10.3390/ijms25073794
IF: 5.6
2024-03-29
International Journal of Molecular Sciences
Abstract:Molecular property prediction is an important task in drug discovery, and with help of self-supervised learning methods, the performance of molecular property prediction could be improved by utilizing large-scale unlabeled dataset. In this paper, we propose a triple generative self-supervised learning method for molecular property prediction, called TGSS. Three encoders including a bi-directional long short-term memory recurrent neural network (BiLSTM), a Transformer, and a graph attention network (GAT) are used in pre-training the model using molecular sequence and graph structure data to extract molecular features. The variational auto encoder (VAE) is used for reconstructing features from the three models. In the downstream task, in order to balance the information between different molecular features, a feature fusion module is added to assign different weights to each feature. In addition, to improve the interpretability of the model, atomic similarity heat maps were introduced to demonstrate the effectiveness and rationality of molecular feature extraction. We demonstrate the accuracy of the proposed method on chemical and biological benchmark datasets by comparative experiments.
biochemistry & molecular biology,chemistry, multidisciplinary
What problem does this paper attempt to address?
The paper aims to address several key issues in molecular property prediction, particularly in the field of drug discovery. Specifically, the main objectives of the study include: 1. **Integrating multiple molecular representation methods**: Existing self-supervised learning methods typically utilize either molecular sequence information or graph structure information. The proposed method (TGSS) combines both molecular sequence and topological structure information to enhance the diversity of molecular representations. 2. **Introducing multiple models for self-supervised learning**: Unlike traditional methods that use only one or two different models for self-supervised learning, the TGSS method introduces three different models (BiLSTM, Transformer, and GAT) during the pre-training phase. These models are trained using a generative self-supervised learning approach to improve the accuracy and generalization ability of the final feature representations. 3. **Designing an effective feature fusion mechanism**: To better integrate information from different models, TGSS proposes an attention module to fully fuse the output features of the three pre-trained models and assign different weights to each feature to avoid the loss of critical information. Through the above methods, TGSS aims to improve the performance of molecular property prediction tasks, achieving significant results, especially in handling classification tasks, and outperforming existing supervised and self-supervised learning methods on multiple benchmark datasets. Additionally, the paper conducts ablation experiments to evaluate the impact of different factors on model performance and enhances model interpretability through feature visualization.