Abstract:Due to the great successes of Graph Neural Networks (GNN) in numerous fields, growing research interests have been devoted to applying GNN to molecular learning tasks. The molecule structure can be naturally represented as graphs where atoms and bonds refer to nodes and edges respectively. However, the atoms are not haphazardly stacked together but combined into various spatial geometries. Meanwhile, since chemical reactions mainly occur in substructures such as functional groups, the substructure plays a decisive role in the molecule's properties. Therefore, directly applying GNN to molecular representation learning could ignore the molecular spatial structure and the substructure properties which in turn degrades the performance of downstream tasks. In this paper, we propose Knowledge-Driven Self-Supervised Model for Molecular Representation Learning (KSMRL) to address above problems. The KSMRL consists of two major pathways: (1) the Spatial Information (SI) based pathway which preserves the spatial information of molecular structure, (2) the Subgraph Constraint (SC) based pathway which retains the properties of substructures into the molecular representation. In this manner, both the atomic level and substructure level information can be included in modeling. According to the experimental results on multiple datasets, the proposed KSMRL can generate discriminative molecular representations. In molecular generation tasks, KSMRL combined with Autoregressive Flow (AF) models or Discrete Flow (DF) models outperforms the state-of-the-art baselines over all datasets. In addition, we demonstrate the effectiveness of KSMRL with property optimization experiments. To indicate the ability of predicting specified potential Drug-Target Interactions (DTIs), a case study for discriminating the interactions between molecule generated by KSMRL and targets is also given.

Extracting Predictive Representations from Hundreds of Millions of Molecules

A Systematic Survey of Chemical Pre-trained Models

Toward Robust Self-Training Paradigm for Molecular Prediction Tasks

Supervised Pretraining for Molecular Force Fields and Properties Prediction

InstructBio: A Large-scale Semi-supervised Learning Paradigm for Biochemical Problems

Masked Molecule Modeling: A New Paradigm of Molecular Representation Learning for Chemistry Understanding

An effective self-supervised framework for learning expressive molecular global representations to drug discovery

Learn molecular representations from large-scale unlabeled molecules for drug discovery

Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT

Relative molecule self-attention transformer

Self-Supervised Molecular Representation Learning With Topology and Geometry

Triple Generative Self-Supervised Learning Method for Molecular Property Prediction

MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction

Bootstrap investigation of the stability of a Cox regression model.

Pre-training Protein Models with Molecular Dynamics Simulations for Drug Binding

3D Molecular Pretraining via Localized Geometric Generation

A Knowledge-Driven Self-Supervised Approach for Molecular Generation

Protein-ligand binding representation learning from fine-grained interactions

A merged molecular representation learning for molecular properties prediction with a web-based service

SMILES-Mamba: Chemical Mamba Foundation Models for Drug ADMET Prediction

Contextual Molecule Representation Learning from Chemical Reaction Knowledge