Abstract:Large language models have made significant strides in natural language processing, enabling innovative applications in molecular science by processing textual representations of molecules. However, most existing language models cannot capture the rich information with complex molecular structures or images. In this paper, we introduce GIT-Mol, a multi-modal large language model that integrates the Graph, Image, and Text information. To facilitate the integration of multi-modal molecular data, we propose GIT-Former, a novel architecture that is capable of aligning all modalities into a unified latent space. We achieve a 5%-10% accuracy increase in properties prediction and a 20.2% boost in molecule generation validity compared to the baselines. With the any-to-language molecular translation strategy, our model has the potential to perform more downstream tasks, such as compound name recognition and chemical reaction prediction.

What problem does this paper attempt to address?

The paper aims to address the issue of multimodal data fusion and processing in the field of molecular science, particularly in applications such as drug discovery, molecular property prediction, and molecular generation tasks. Specifically, the researchers have developed a multimodal large language model named GIT-Mol, which is capable of integrating molecular graph structures, images, and textual information to improve the performance of various tasks in molecular science. At the core of GIT-Mol is GIT-Former, a novel architecture that aligns data from different modalities (including molecular graphs, images, and text) into a unified latent space through self-attention and cross-attention mechanisms. This model design not only enhances the capability to process and integrate multimodal data but also effectively overcomes the challenges of scalability in molecular representation and generation models. The main contributions of the paper are as follows: 1. The development of a multimodal large language model, GIT-Mol, specifically for the field of molecular science, which covers the three main modalities (graphs, images, and text) in molecular science and performs excellently in tasks such as molecular generation, molecular description, molecular image recognition, and molecular property prediction. 2. The introduction of GIT-Former, a modality mixer with a cross-attention mechanism, capable of seamlessly fusing three types of modal data at the molecular level, ensuring flexibility and scalability. 3. Experimental results show that GIT-Mol has significantly improved effectiveness in molecular generation tasks and accuracy in molecular property prediction tasks compared to baseline models, with improvements of 20.2% and 5%-10%, respectively. In summary, GIT-Mol aims to leverage a large amount of unlabeled multimodal data, and through its advanced cross-modal fusion capabilities, provide a powerful tool for various applications in molecular science, especially in terms of enhancing speed, accuracy, and scalability.

GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text

GIT-Mol: A multi-modal large language model for molecular science with graph, image, and text

The Future of Molecular Studies Through the Lens of Large Language Models.

Exploring the Potential of Large Language Models in Molecular Tasks: An Insightful Evaluation with GPT‐4

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective

3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization

Molecular Graph Representation Learning Integrating Large Language Models with Domain-specific Small Models

Large language model for molecular chemistry

Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model

MolLM : a unified language model for integrating biomedical text with 2D and 3D molecular representations

G2T-LLM: Graph-to-Tree Text Encoding for Molecule Generation with Fine-Tuned Large Language Models

Can Large Language Models Understand Molecules?

MolTC: Towards Molecular Relational Modeling In Language Models

MultiModal-Learning for Predicting Molecular Properties: A Framework Based on Image and Graph Structures

MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter

Pretraining Graph Transformer for Molecular Representation with Fusion of Multimodal Information

Can Large Language Models Empower Molecular Property Prediction?

MolScribe: Robust Molecular Structure Recognition with Image-To-Graph Generation

Towards 3D Molecule-Text Interpretation in Language Models