AdaMR: Adaptable Molecular Representation for Unified Pre-training Strategy

Yan Ding,Hao Cheng,Ziliang Ye,Ruyi Feng,Wei Tian,Peng Xie,Juan Zhang,Zhongze Gu

2024-04-27

Abstract:We propose Adjustable Molecular Representation (AdaMR), a new large-scale uniform pre-training strategy for small-molecule drugs, as a novel unified pre-training strategy. AdaMR utilizes a granularity-adjustable molecular encoding strategy, which is accomplished through a pre-training job termed molecular canonicalization, setting it apart from recent large-scale molecular models. This adaptability in granularity enriches the model's learning capability at multiple levels and improves its performance in multi-task scenarios. Specifically, the substructure-level molecular representation preserves information about specific atom groups or arrangements, influencing chemical properties and functionalities. This proves advantageous for tasks such as property prediction. Simultaneously, the atomic-level representation, combined with generative molecular canonicalization pre-training tasks, enhances validity, novelty, and uniqueness in generative tasks. All of these features work together to give AdaMR outstanding performance on a range of downstream tasks. We fine-tuned our proposed pre-trained model on six molecular property prediction tasks (MoleculeNet datasets) and two generative tasks (ZINC250K datasets), achieving state-of-the-art (SOTA) results on five out of eight tasks.

Biomolecules,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The main problem this paper attempts to address is the encoding deficiencies in existing computational modeling methods for small molecule drugs, particularly the issue of information loss when handling molecular structure representations. Specifically, existing large-scale molecular models typically use a fixed encoding granularity to represent small molecules, which leads to inconsistent performance across different downstream tasks. For example, performance in generation tasks may be better than in property prediction tasks, and vice versa. Additionally, existing models often overlook the synonymy of SMILES (Simplified Molecular Input Line Entry System) representations, meaning that the same molecule can have multiple equivalent SMILES representations. This results in limitations when models learn the intrinsic related features of molecular sequences. To address these issues, the paper proposes AdaMR (Adaptable Molecular Representation), a new, unified large-scale pre-training strategy designed to enrich the model's learning capabilities and improve its performance in multi-task scenarios through an adjustable molecular encoding granularity strategy. AdaMR introduces a pre-training task of molecular canonicalization, enabling the model to deeply understand the inherent related features within molecular sequences while enhancing its performance in generation tasks. These characteristics allow AdaMR to excel in a range of downstream tasks, including six molecular property prediction tasks (MoleculeNet dataset) and two generation tasks (ZINC250K dataset), achieving state-of-the-art (SOTA) results in five out of the eight tasks.

AdaMR: Adaptable Molecular Representation for Unified Pre-training Strategy

A Systematic Survey of Chemical Pre-trained Models

ComABAN: refining molecular representation with the graph attention mechanism to accelerate drug discovery

Quantum-Informed Molecular Representation Learning Enhancing ADMET Property Prediction

Unified 2D and 3D Pre-Training of Molecular Representations

Bridging the Gap between Chemical Reaction Pretraining and Conditional Molecule Generation with a Unified Model

Dual-view Molecular Pre-training

From Molecules to Materials: Pre-training Large Generalizable Models for Atomic Property Prediction

Uni-Mol: A Universal 3D Molecular Representation Learning Framework

Multilingual Molecular Representation Learning via Contrastive Pre-training

Can Pre-trained Models Really Learn Better Molecular Representations for AI-aided Drug Discovery?

Automated 3D Pre-Training for Molecular Property Prediction

Adaptive language model training for molecular design

Gram matrix: an efficient representation of molecular conformation and learning objective for molecular pretraining

Towards Effective and Generalizable Fine-tuning for Pre-trained Molecular Graph Models

MvMRL: a multi-view molecular representation learning method for molecular property prediction

ABT-MPNN: an atom-bond transformer-based message-passing neural network for molecular property prediction

Prediction of chemical reaction yields with large-scale multi-view pre-training

MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction

PRESTO: Progressive Pretraining Enhances Synthetic Chemistry Outcomes