MolPROP: Molecular Property prediction with multimodal language and graph fusion

Zachary A. Rollins,Alan C. Cheng,Essam Metwally
DOI: https://doi.org/10.1186/s13321-024-00846-9
2024-05-24
Journal of Cheminformatics
Abstract:Pretrained deep learning models self-supervised on large datasets of language, image, and graph representations are often fine-tuned on downstream tasks and have demonstrated remarkable adaptability in a variety of applications including chatbots, autonomous driving, and protein folding. Additional research aims to improve performance on downstream tasks by fusing high dimensional data representations across multiple modalities. In this work, we explore a novel fusion of a pretrained language model, ChemBERTa-2, with graph neural networks for the task of molecular property prediction. We benchmark the MolPROP suite of models on seven scaffold split MoleculeNet datasets and compare with state-of-the-art architectures. We find that (1) multimodal property prediction for small molecules can match or significantly outperform modern architectures on hydration free energy (FreeSolv), experimental water solubility (ESOL), lipophilicity (Lipo), and clinical toxicity tasks (ClinTox), (2) the MolPROP multimodal fusion is predominantly beneficial on regression tasks, (3) the ChemBERTa-2 masked language model pretraining task (MLM) outperformed multitask regression pretraining task (MTR) when fused with graph neural networks for multimodal property prediction, and (4) despite improvements from multimodal fusion on regression tasks MolPROP significantly underperforms on some classification tasks. MolPROP has been made available at https://github.com/merck/MolPROP.
chemistry, multidisciplinary,computer science, interdisciplinary applications, information systems
What problem does this paper attempt to address?
The paper primarily explores a new multimodal fusion method that combines pre-trained language models with graph neural networks for the prediction of small molecule properties. Specifically, the researchers developed a method called MolPROP, which integrates ChemBERTa-2 (a pre-trained language model) with graph neural networks (including Graph Convolutional Networks GCN and Graph Attention Networks GATv2) to improve the accuracy of small molecule property predictions. The main contributions of the paper can be summarized as follows: 1. **Multimodal Fusion**: The authors explore a novel approach that combines pre-trained language models (ChemBERTa-2) with graph neural networks for supervised tasks—molecular property prediction. This fusion significantly enhances the performance of certain regression prediction tasks and provides opportunities to explore different fusion strategies in multimodal molecular property prediction classification tasks. 2. **Performance Evaluation**: The MolPROP model was benchmarked on seven different MoleculeNet datasets, which were divided into regression tasks (such as hydration free energy, experimental water solubility, hydrophobicity, etc.) and classification tasks (such as inhibition of human β-secretase activity, blood-brain barrier permeability, and clinical toxicity). The results show that the MolPROP model performs excellently on regression tasks, even surpassing modern architectures; however, its performance on classification tasks is more complex, performing well in clinical toxicity prediction but not as expected in other classification tasks. 3. **Key Findings**: - Multimodal property prediction for small molecule regression tasks can match or significantly surpass modern architectures. - The fusion of language and graph models is mainly beneficial for regression tasks. - When performing multimodal fusion with graph neural networks, the masked language model pre-training task (MLM) of ChemBERTa-2 performs better than the multi-task regression pre-training task (MTR). - Despite improvements in regression tasks, MolPROP performs poorly on some classification tasks. In summary, this study demonstrates the potential of combining language models and graph neural networks in small molecule property prediction, particularly achieving good results in regression tasks, while also revealing challenges faced in classification tasks.