Abstract:Providing explainable molecular property predictions is critical for many scientific domains, such as drug discovery and material science. Though transformer-based language models have shown great potential in accurate molecular property prediction, they neither provide chemically meaningful explanations nor faithfully reveal the molecular structure-property relationships. In this work, we develop a framework for explainable molecular property prediction based on language models, dubbed as Lamole, which can provide chemical concepts-aligned explanations. We take a string-based molecular representation -- Group SELFIES -- as input tokens to pretrain and fine-tune our Lamole, as it provides chemically meaningful semantics. By disentangling the information flows of Lamole, we propose combining self-attention weights and gradients for better quantification of each chemically meaningful substructure's impact on the model's output. To make the explanations more faithfully respect the structure-property relationship, we then carefully craft a marginal loss to explicitly optimize the explanations to be able to align with the chemists' annotations. We bridge the manifold hypothesis with the elaborated marginal loss to prove that the loss can align the explanations with the tangent space of the data manifold, leading to concept-aligned explanations. Experimental results over six mutagenicity datasets and one hepatotoxicity dataset demonstrate Lamole can achieve comparable classification accuracy and boost the explanation accuracy by up to 14.3%, being the state-of-the-art in explainable molecular property prediction.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the issue of interpretability in molecular property prediction. Although Transformer-based language models show great potential in accurately predicting molecular properties, they neither provide chemically meaningful explanations nor faithfully reveal the relationship between molecular structure and properties. Specifically, existing methods have the following shortcomings: 1. **Molecular Representation**: Common molecular representation methods (such as SMILES) fail to explicitly encode chemically meaningful substructures, so existing interpretability methods can only highlight individual atoms and bonds as explanations. 2. **Interpretability Techniques**: Existing interpretability methods have two main limitations: - They cannot effectively capture the interactions between functional groups within the molecular structure. - The generated explanations do not align with chemists' intuition, thus failing to faithfully reflect the structure-property relationship. To address these issues, the authors propose a language model-based interpretable molecular property prediction framework—Lamole. This framework uses Group SELFIES strings as input to provide chemically aligned explanations. By decoupling the information flow of the Transformer, Lamole combines self-attention weights and gradients to better quantify the impact of each chemically meaningful substructure on the model output. Additionally, the authors design a marginal loss function to align the explanations with chemists' annotations, thereby improving the accuracy of the explanations. ### Main Contributions 1. **Chemically Meaningful Explanations**: By using Group SELFIES to pre-train and fine-tune the language model, Lamole can more easily understand chemically meaningful semantics and generate more accurate explanations by decoupling the information flow. 2. **Improved Explanation Accuracy**: By designing a marginal loss function, Lamole can significantly improve the accuracy of explanations, increasing explanation accuracy by 5% with only a small number of molecules with ground-truth annotations. 3. **Theoretical Analysis**: The authors are the first to link the manifold hypothesis with interpretable molecular property prediction, theoretically proving that the designed marginal loss function can align explanations with the data manifold, respecting the structure-property relationship. ### Experimental Results Experimental results show that Lamole achieves comparable classification accuracy on six mutagenicity and one hepatotoxicity datasets and improves explanation accuracy by up to 14.3%. Compared to alternative baselines, Lamole's explanation rationality improves by up to 9%. Extensive experimental studies demonstrate that Lamole achieves state-of-the-art performance in interpretable molecular property prediction.

Explainable Molecular Property Prediction: Aligning Chemical Concepts with Predictions via Language Models

Unveiling Molecular Secrets: An LLM-Augmented Linear Model for Explainable and Calibratable Molecular Property Prediction

Can Large Language Models Empower Molecular Property Prediction?

Explainability Techniques for Chemical Language Models

Explainable Fragment-Based Molecular Property Attribution

An Explainable Molecular Property Prediction Via Multi-Granularity.

Chemical Property Relation Guided Few-Shot Molecular Property Prediction

MolPROP: Molecular Property prediction with multimodal language and graph fusion

What can Attribution Methods show us about Chemical Language Models?

MolCloze - A Unified Cloze-style Self-supervised Molecular Structure Learning Model for Chemical Property Prediction.

Fast and Effective Molecular Property Prediction with Transferability Map

Molecular Descriptors Property Prediction Using Transformer-Based Approach

Large-scale chemical language representations capture molecular structure and properties

Molecular Property Prediction by Combining LSTM and GAT

Molecular Graph Representation Learning Integrating Large Language Models with Domain-specific Small Models

A merged molecular representation learning for molecular properties prediction with a web-based service

Beyond Chemical Language: A Multimodal Approach to Enhance Molecular Property Prediction

Improving Molecular Properties Prediction Through Latent Space Fusion

Integrating Chemical Language and Molecular Graph in Multimodal Fused Deep Learning for Drug Property Prediction

MolTRES: Improving Chemical Language Representation Learning for Molecular Property Prediction

MolRoPE-BERT: an Enhanced Molecular Representation with Rotary Position Embedding for Molecular Property Prediction