Abstract:Self-supervised learning has recently gained growing interest in molecular modeling for scientific tasks such as AI-assisted drug discovery. Current studies consider leveraging both 2D and 3D molecular structures for representation learning. However, relying on straightforward alignment strategies that treat each modality separately, these methods fail to exploit the intrinsic correlation between 2D and 3D representations that reflect the underlying structural characteristics of molecules, and only perform coarse-grained molecule-level alignment. To derive fine-grained alignment and promote structural molecule understanding, we introduce an atomic-relation level "blend-then-predict" self-supervised learning approach, MoleBLEND, which first blends atom relations represented by different modalities into one unified relation matrix for joint encoding, then recovers modality-specific information for 2D and 3D structures individually. By treating atom relationships as anchors, MoleBLEND organically aligns and integrates visually dissimilar 2D and 3D modalities of the same molecule at fine-grained atomic level, painting a more comprehensive depiction of each molecule. Extensive experiments show that MoleBLEND achieves state-of-the-art performance across major 2D/3D molecular benchmarks. We further provide theoretical insights from the perspective of mutual-information maximization, demonstrating that our method unifies contrastive, generative (cross-modality prediction) and mask-then-predict (single-modality prediction) objectives into one single cohesive framework.

What problem does this paper attempt to address?

The problem this paper attempts to address is how to effectively utilize the intrinsic correlation between 2D and 3D molecular structures in molecular modeling to achieve finer-grained atomic-level alignment, thereby facilitating a deeper understanding of molecular structures. Existing methods typically employ direct alignment strategies, handling 2D and 3D modalities separately, which results in coarse-grained alignment of molecular structural features and fails to fully leverage the fundamental structural characteristics reflected between the two modalities. To solve this problem, the paper proposes a new method called MOLEBLEND. MOLEBLEND uses a self-supervised learning approach of "blend then predict" at the atomic relationship level. It first fuses atomic relationships represented by different modalities into a unified relationship matrix for joint encoding, and then separately recovers specific information of the 2D and 3D structures. This method organically aligns and integrates different modalities (2D and 3D) of the same molecule at a fine-grained atomic level by using atomic relationships as anchors, thus providing a more comprehensive molecular description. Specifically, the MOLEBLEND method includes two main steps: 1. **Modality Blending Encoding**: Fusing atomic relationships under different modalities into a unified relationship matrix for model input. 2. **Modality Target Prediction**: Recovering the original 2D and 3D input information from the blended representation. In this way, MOLEBLEND can achieve alignment of 2D and 3D modalities at a fine-grained atomic relationship level, thereby enhancing the depth of understanding of molecular structures and achieving state-of-the-art performance in multiple 2D and 3D molecular benchmarks. Additionally, the paper provides theoretical insights from the perspective of mutual information maximization, demonstrating that this method can unify contrastive learning, generative (cross-modal prediction), and masked prediction (single-modal prediction) objectives into a single framework.

Multimodal Molecular Pretraining via Modality Blending

Multimodal Fusion with Relational Learning for Molecular Property Prediction

Improving Molecular Pretraining with Complementary Featurizations

MolBind: Multimodal Alignment of Language, Molecules, and Proteins

Complementary multi-modality molecular self-supervised learning via non-overlapping masking for property prediction

MolFusion: Multimodal Fusion Learning for Molecular Representations via Multi-granularity Views

A Group Symmetric Stochastic Differential Equation Model for Molecule Multi-modal Pretraining

MoleMCL: a multi-level contrastive learning framework for molecular pre-training

Multilingual Molecular Representation Learning via Contrastive Pre-training

Health and medicine in Indonesia.

Pretraining Graph Transformer for Molecular Representation with Fusion of Multimodal Information

MultiModal-Learning for Predicting Molecular Properties: A Framework Based on Image and Graph Structures

Multimodal fused deep learning for drug property prediction: Integrating chemical language and molecular graph

MolMix: A Simple Yet Effective Baseline for Multimodal Molecular Representation Learning

Unified 2D and 3D Pre-Training of Molecular Representations

Cross‐Modal Graph Contrastive Learning with Cellular Images

Bidirectional generation of structure and properties through a single molecular foundation model

Integrating Chemical Language and Molecular Graph in Multimodal Fused Deep Learning for Drug Property Prediction

MMCL-CPI: A multi-modal compound-protein interaction prediction model incorporating contrastive learning pre-training