Abstract:Pre-training machine learning models on molecular properties has proven effective for generating robust and generalizable representations, which is critical for advancements in drug discovery and materials science. While recent work has primarily focused on data-driven approaches, the KANO model introduces a novel paradigm by incorporating knowledge-enhanced pre-training. In this work, we expand upon KANO by integrating the large-scale ChEBI knowledge graph, which includes 2,840 functional groups -- significantly more than the original 82 used in KANO. We explore two approaches, Replace and Integrate, to incorporate this extensive knowledge into the KANO framework. Our results demonstrate that including ChEBI leads to improved performance on 9 out of 14 molecular property prediction datasets. This highlights the importance of utilizing a larger and more diverse set of functional groups to enhance molecular representations for property predictions. Code: <a class="link-external link-http" href="http://github.com/Yasir-Ghunaim/KANO-ChEBI" rel="external noopener nofollow">this http URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to enhance the performance of molecular property prediction by integrating large - scale knowledge graphs (such as the ChEBI knowledge graph). Specifically, the paper aims to overcome the limitations of existing methods that mainly rely on data - driven approaches and neglect the integration of scientific knowledge, and improve the quality of molecular representations by introducing a broader and more diverse set of functional groups, thereby improving molecular property prediction in drug discovery and materials science. ### Specific background of the problem 1. **Limitations of existing methods**: - Most of the existing molecular property prediction models adopt data - driven methods. Although these methods are effective, their generalization ability is limited by specific datasets. - Although the KANO model has introduced a knowledge - enhanced pre - training paradigm, the set of functional groups it uses is relatively small (only 82), which limits the diversity of chemical structures. 2. **Introducing a larger - scale knowledge graph**: - The ChEBI knowledge graph contains 2,840 functional groups, far exceeding the 82 functional groups used in KANO. - The paper hypothesizes that by integrating a larger - scale and more diverse set of functional groups, the chemical diversity and prediction performance of molecular representations can be further improved. ### Solutions To achieve this goal, the paper proposes two methods to integrate the ChEBI knowledge graph into the KANO framework: 1. **Replace**: Remove the original functional group sub - graph and replace it with the ChEBI functional group sub - graph. 2. **Integrate**: Add the ChEBI functional group sub - graph without removing the original data. Through experimental verification, the paper shows the performance of these two methods on multiple benchmark datasets and analyzes their improvements compared to the original KANO model. ### Experimental results - **Classification tasks**: The KANO variants using ChEBI functional groups show performance improvements in 6 out of 8 classification datasets. - **Regression tasks**: On the ESOL, FreeSolv, and QM8 datasets, there are slight improvements after integrating the ChEBI functional groups, but no significant improvements are observed on the Lipophilicity, QM7, and QM9 datasets. ### Conclusions The paper shows that by integrating a larger - scale knowledge graph, the performance of molecular property prediction can be significantly improved on certain tasks, but it also points out the sensitivity of this method to different task types and emphasizes the need for further research.

Large-Scale Knowledge Integration for Enhanced Molecular Property Prediction

Knowledge graph-enhanced molecular contrastive learning with functional prompt

Impact of Domain Knowledge and Multi-Modality on Intelligent Molecular Property Prediction: A Systematic Survey

MolKD: Distilling Cross-Modal Knowledge in Chemical Reactions for Molecular Property Prediction

Advanced deep learning methods for molecular property prediction

KnoMol: A Knowledge-Enhanced Graph Transformer for Molecular Property Prediction

KG-MTL: Knowledge Graph Enhanced Multi-Task Learning for Molecular Interaction

Knowledge-aware Contrastive Molecular Graph Learning

Molecular Graph Representation Learning Integrating Large Language Models with Domain-specific Small Models

KA-GNN: Kolmogorov-Arnold Graph Neural Networks for Molecular Property Prediction

KPGT: Knowledge-Guided Pre-training of Graph Transformer for Molecular Property Prediction

Enhancing Model Learning and Interpretation Using Multiple Molecular Graph Representations for Compound Property and Activity Prediction

Knowledge-Embedded Message-Passing Neural Networks: Improving Molecular Property Prediction with Human Knowledge

Deep Neural Networks for Knowledge-Enhanced Molecular Modeling

Advanced graph and sequence neural networks for molecular property prediction and drug discovery

KGE-UNIT: toward the unification of molecular interactions prediction based on knowledge graph and multi-task learning on drug discovery

Deep learning integration of molecular and interactome data for protein–compound interaction prediction

Improving Molecular Properties Prediction Through Latent Space Fusion

Analyzing Learned Molecular Representations for Property Prediction

Bi-level Contrastive Learning for Knowledge-Enhanced Molecule Representations

Integrating Chemical Language and Molecular Graph in Multimodal Fused Deep Learning for Drug Property Prediction