GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text

Pengfei Liu,Yiming Ren,Jun Tao,Zhixiang Ren
DOI: https://doi.org/10.1016/j.compbiomed.2024.108073
2024-02-06
Abstract:Large language models have made significant strides in natural language processing, enabling innovative applications in molecular science by processing textual representations of molecules. However, most existing language models cannot capture the rich information with complex molecular structures or images. In this paper, we introduce GIT-Mol, a multi-modal large language model that integrates the Graph, Image, and Text information. To facilitate the integration of multi-modal molecular data, we propose GIT-Former, a novel architecture that is capable of aligning all modalities into a unified latent space. We achieve a 5%-10% accuracy increase in properties prediction and a 20.2% boost in molecule generation validity compared to the baselines. With the any-to-language molecular translation strategy, our model has the potential to perform more downstream tasks, such as compound name recognition and chemical reaction prediction.
Machine Learning,Computation and Language,Biomolecules
What problem does this paper attempt to address?
The paper aims to address the issue of multimodal data fusion and processing in the field of molecular science, particularly in applications such as drug discovery, molecular property prediction, and molecular generation tasks. Specifically, the researchers have developed a multimodal large language model named GIT-Mol, which is capable of integrating molecular graph structures, images, and textual information to improve the performance of various tasks in molecular science. At the core of GIT-Mol is GIT-Former, a novel architecture that aligns data from different modalities (including molecular graphs, images, and text) into a unified latent space through self-attention and cross-attention mechanisms. This model design not only enhances the capability to process and integrate multimodal data but also effectively overcomes the challenges of scalability in molecular representation and generation models. The main contributions of the paper are as follows: 1. The development of a multimodal large language model, GIT-Mol, specifically for the field of molecular science, which covers the three main modalities (graphs, images, and text) in molecular science and performs excellently in tasks such as molecular generation, molecular description, molecular image recognition, and molecular property prediction. 2. The introduction of GIT-Former, a modality mixer with a cross-attention mechanism, capable of seamlessly fusing three types of modal data at the molecular level, ensuring flexibility and scalability. 3. Experimental results show that GIT-Mol has significantly improved effectiveness in molecular generation tasks and accuracy in molecular property prediction tasks compared to baseline models, with improvements of 20.2% and 5%-10%, respectively. In summary, GIT-Mol aims to leverage a large amount of unlabeled multimodal data, and through its advanced cross-modal fusion capabilities, provide a powerful tool for various applications in molecular science, especially in terms of enhancing speed, accuracy, and scalability.