A Knowledge-Driven Self-Supervised Approach for Molecular Generation

Maotao Liu,Yifan Yang,Qun Liu,Li Liu,Guoyin Wang
DOI: https://doi.org/10.1109/TCBB.2024.3406600
2024-05-28
Abstract:Due to the great successes of Graph Neural Networks (GNN) in numerous fields, growing research interests have been devoted to applying GNN to molecular learning tasks. The molecule structure can be naturally represented as graphs where atoms and bonds refer to nodes and edges respectively. However, the atoms are not haphazardly stacked together but combined into various spatial geometries. Meanwhile, since chemical reactions mainly occur in substructures such as functional groups, the substructure plays a decisive role in the molecule's properties. Therefore, directly applying GNN to molecular representation learning could ignore the molecular spatial structure and the substructure properties which in turn degrades the performance of downstream tasks. In this paper, we propose Knowledge-Driven Self-Supervised Model for Molecular Representation Learning (KSMRL) to address above problems. The KSMRL consists of two major pathways: (1) the Spatial Information (SI) based pathway which preserves the spatial information of molecular structure, (2) the Subgraph Constraint (SC) based pathway which retains the properties of substructures into the molecular representation. In this manner, both the atomic level and substructure level information can be included in modeling. According to the experimental results on multiple datasets, the proposed KSMRL can generate discriminative molecular representations. In molecular generation tasks, KSMRL combined with Autoregressive Flow (AF) models or Discrete Flow (DF) models outperforms the state-of-the-art baselines over all datasets. In addition, we demonstrate the effectiveness of KSMRL with property optimization experiments. To indicate the ability of predicting specified potential Drug-Target Interactions (DTIs), a case study for discriminating the interactions between molecule generated by KSMRL and targets is also given.
What problem does this paper attempt to address?