Explainable Molecular Property Prediction: Aligning Chemical Concepts with Predictions via Language Models

Zhenzhong Wang,Zehui Lin,Wanyu Lin,Ming Yang,Minggang Zeng,Kay Chen Tan
2024-10-02
Abstract:Providing explainable molecular property predictions is critical for many scientific domains, such as drug discovery and material science. Though transformer-based language models have shown great potential in accurate molecular property prediction, they neither provide chemically meaningful explanations nor faithfully reveal the molecular structure-property relationships. In this work, we develop a framework for explainable molecular property prediction based on language models, dubbed as Lamole, which can provide chemical concepts-aligned explanations. We take a string-based molecular representation -- Group SELFIES -- as input tokens to pretrain and fine-tune our Lamole, as it provides chemically meaningful semantics. By disentangling the information flows of Lamole, we propose combining self-attention weights and gradients for better quantification of each chemically meaningful substructure's impact on the model's output. To make the explanations more faithfully respect the structure-property relationship, we then carefully craft a marginal loss to explicitly optimize the explanations to be able to align with the chemists' annotations. We bridge the manifold hypothesis with the elaborated marginal loss to prove that the loss can align the explanations with the tangent space of the data manifold, leading to concept-aligned explanations. Experimental results over six mutagenicity datasets and one hepatotoxicity dataset demonstrate Lamole can achieve comparable classification accuracy and boost the explanation accuracy by up to 14.3%, being the state-of-the-art in explainable molecular property prediction.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the issue of interpretability in molecular property prediction. Although Transformer-based language models show great potential in accurately predicting molecular properties, they neither provide chemically meaningful explanations nor faithfully reveal the relationship between molecular structure and properties. Specifically, existing methods have the following shortcomings: 1. **Molecular Representation**: Common molecular representation methods (such as SMILES) fail to explicitly encode chemically meaningful substructures, so existing interpretability methods can only highlight individual atoms and bonds as explanations. 2. **Interpretability Techniques**: Existing interpretability methods have two main limitations: - They cannot effectively capture the interactions between functional groups within the molecular structure. - The generated explanations do not align with chemists' intuition, thus failing to faithfully reflect the structure-property relationship. To address these issues, the authors propose a language model-based interpretable molecular property prediction framework—Lamole. This framework uses Group SELFIES strings as input to provide chemically aligned explanations. By decoupling the information flow of the Transformer, Lamole combines self-attention weights and gradients to better quantify the impact of each chemically meaningful substructure on the model output. Additionally, the authors design a marginal loss function to align the explanations with chemists' annotations, thereby improving the accuracy of the explanations. ### Main Contributions 1. **Chemically Meaningful Explanations**: By using Group SELFIES to pre-train and fine-tune the language model, Lamole can more easily understand chemically meaningful semantics and generate more accurate explanations by decoupling the information flow. 2. **Improved Explanation Accuracy**: By designing a marginal loss function, Lamole can significantly improve the accuracy of explanations, increasing explanation accuracy by 5% with only a small number of molecules with ground-truth annotations. 3. **Theoretical Analysis**: The authors are the first to link the manifold hypothesis with interpretable molecular property prediction, theoretically proving that the designed marginal loss function can align explanations with the data manifold, respecting the structure-property relationship. ### Experimental Results Experimental results show that Lamole achieves comparable classification accuracy on six mutagenicity and one hepatotoxicity datasets and improves explanation accuracy by up to 14.3%. Compared to alternative baselines, Lamole's explanation rationality improves by up to 9%. Extensive experimental studies demonstrate that Lamole achieves state-of-the-art performance in interpretable molecular property prediction.