ChemDFM-X: Towards Large Multimodal Model for Chemistry

Zihan Zhao,Bo Chen,Jingpiao Li,Lu Chen,Liyang Wen,Pengyu Wang,Zichen Zhu,Danyang Zhang,Ziping Wan,Yansi Li,Zhongyang Dai,Xin Chen,Kai Yu
2024-09-20
Abstract:Rapid developments of AI tools are expected to offer unprecedented assistance to the research of natural science including chemistry. However, neither existing unimodal task-specific specialist models nor emerging general large multimodal models (LMM) can cover the wide range of chemical data modality and task categories. To address the real demands of chemists, a cross-modal Chemical General Intelligence (CGI) system, which serves as a truly practical and useful research assistant utilizing the great potential of LMMs, is in great need. In this work, we introduce the first Cross-modal Dialogue Foundation Model for Chemistry (ChemDFM-X). Diverse multimodal data are generated from an initial modality by approximate calculations and task-specific model predictions. This strategy creates sufficient chemical training corpora, while significantly reducing excessive expense, resulting in an instruction-tuning dataset containing 7.6M data. After instruction finetuning, ChemDFM-X is evaluated on extensive experiments of different chemical tasks with various data modalities. The results demonstrate the capacity of ChemDFM-X for multimodal and inter-modal knowledge comprehension. ChemDFM-X marks a significant milestone toward aligning all modalities in chemistry, a step closer to CGI.
Machine Learning,Computation and Language,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in chemical research, the existing single - modal task - specific models or the emerging general - purpose large - scale multimodal models (LMM) cannot cover the wide range of modalities and task categories of chemical data. Specifically, chemical data encompasses multiple modalities, ranging from text descriptions, molecular structures to images and spectra, and chemical tasks also include various forms such as property prediction and retrosynthetic analysis. Although these single - modal specialized models can achieve state - of - the - art performance in their respective tasks, they are essentially unable to handle slightly different tasks or cope with corresponding tasks when the input modality is slightly changed. Therefore, the practical utility and auxiliary ability of these models in research and manufacturing are limited. To meet the practical needs of chemists, there is an urgent need for a cross - modal Chemical General Intelligence (CGI) system that can utilize the great potential of large - scale multimodal models (LMM) as a truly practical and useful research assistant. For this purpose, the authors propose the first cross - modal dialogue - based model, ChemDFM - X, which aims to understand and interpret data of multiple chemical modalities and complete multiple downstream tasks while using the same set of model weights. In this way, ChemDFM - X demonstrates its ability to understand multimodal and cross - modal knowledge in a wide range of experiments on different chemical tasks, marking an important step towards the alignment of all modalities in chemistry and getting closer to the realization of CGI.