3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization

Qizhi Pei,Lijun Wu,Kaiyuan Gao,Jinhua Zhu,Rui Yan

2024-06-09

Abstract:The integration of molecule and language has garnered increasing attention in molecular science. Recent advancements in Language Models (LMs) have demonstrated potential for the comprehensive modeling of molecule and language. However, existing works exhibit notable limitations. Most existing works overlook the modeling of 3D information, which is crucial for understanding molecular structures and also functions. While some attempts have been made to leverage external structure encoding modules to inject the 3D molecular information into LMs, there exist obvious difficulties that hinder the integration of molecular structure and language text, such as modality alignment and separate tuning. To bridge this gap, we propose 3D-MolT5, a unified framework designed to model both 1D molecular sequence and 3D molecular structure. The key innovation lies in our methodology for mapping fine-grained 3D substructure representations (based on 3D molecular fingerprints) to a specialized 3D token vocabulary for 3D-MolT5. This 3D structure token vocabulary enables the seamless combination of 1D sequence and 3D structure representations in a tokenized format, allowing 3D-MolT5 to encode molecular sequence (SELFIES), molecular structure, and text sequences within a unified architecture. Alongside, we further introduce 1D and 3D joint pre-training to enhance the model's comprehension of these diverse modalities in a joint representation space and better generalize to various tasks for our foundation model. Through instruction tuning on multiple downstream datasets, our proposed 3D-MolT5 shows superior performance than existing methods in molecular property prediction, molecule captioning, and text-based molecule generation tasks. Our code will be available on GitHub soon.

Biomolecules,Artificial Intelligence,Computational Engineering, Finance, and Science,Computation and Language,Machine Learning

What problem does this paper attempt to address?

This paper focuses on the problem of integrating molecular and language modeling in molecular science. Most existing methods overlook the modeling of three-dimensional (3D) information, which is crucial for understanding molecular structure and function. To address this issue, the paper proposes the 3D-MolT5 framework, which is a unified model capable of understanding and handling 3D molecular structures and related tasks. 3D-MolT5 maps the fine-grained 3D substructure representations to a dedicated 3D vocabulary using the 3D molecular fingerprint algorithm (E3FP), enabling seamless integration of 1D sequences and 3D structure representations in a tokenized form. Additionally, the model incorporates joint pretraining of 1D and 3D to enhance understanding and generalization across different modalities. Experimental results on multiple downstream datasets demonstrate that 3D-MolT5 outperforms existing methods in tasks such as molecular property prediction, molecular description generation, and text-based molecular design.

3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization

Towards 3D Molecule-Text Interpretation in Language Models

Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates

MolLM : a unified language model for integrating biomedical text with 2D and 3D molecular representations

3D-Transformer: Molecular Representation with Transformer in 3D Space

Sculpting Molecules in Text-3D Space: A Flexible Substructure Aware Framework for Text-Oriented Molecular Optimization

UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation

Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing

MolTC: Towards Molecular Relational Modeling In Language Models

Translation between Molecules and Natural Language

Uni-Mol: A Universal 3D Molecular Representation Learning Framework

Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model

Atomas: Hierarchical Alignment on Molecule-Text for Unified Molecule Understanding and Generation

BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

Token-Mol 1.0: Tokenized drug design with large language model

MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter

BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning

3D-Mol: A Novel Contrastive Learning Framework for Molecular Property Prediction with 3D Information