Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey

Qizhi Pei,Lijun Wu,Kaiyuan Gao,Jinhua Zhu,Yue Wang,Zun Wang,Tao Qin,Rui Yan

2024-03-05

Abstract:The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology. This approach leverages the rich, multifaceted descriptions of biomolecules contained within textual data sources to enhance our fundamental understanding and enable downstream computational tasks such as biomolecule property prediction. The fusion of the nuanced narratives expressed through natural language with the structural and functional specifics of biomolecules described via various molecular modeling techniques opens new avenues for comprehensively representing and analyzing biomolecules. By incorporating the contextual language data that surrounds biomolecules into their modeling, BL aims to capture a holistic view encompassing both the symbolic qualities conveyed through language as well as quantitative structural characteristics. In this review, we provide an extensive analysis of recent advancements achieved through cross modeling of biomolecules and natural language. (1) We begin by outlining the technical representations of biomolecules employed, including sequences, 2D graphs, and 3D structures. (2) We then examine in depth the rationale and key objectives underlying effective multi-modal integration of language and molecular data sources. (3) We subsequently survey the practical applications enabled to date in this developing research area. (4) We also compile and summarize the available resources and datasets to facilitate future work. (5) Looking ahead, we identify several promising research directions worthy of further exploration and investment to continue advancing the field. The related resources and contents are updating in \url{

Computation and Language,Artificial Intelligence,Biomolecules

What problem does this paper attempt to address?

The paper aims to address the following issues: 1. **Integration of Biomolecules and Natural Language**: The paper primarily explores how to combine biomolecules (such as proteins and small molecules) with natural language processing techniques to enhance the understanding of biomolecular properties and functions. By integrating the rich descriptions in textual data with biomolecular structural information, more comprehensive modeling can be achieved. 2. **Multimodal Modeling**: The research investigates how to utilize multimodal methods (including sequences, 2D graphs, 3D structures, etc.) to improve the representation capabilities of biomolecules. Specifically, by combining language models (such as the GPT series) with other machine learning frameworks, models capable of comprehensively handling multiple data sources are developed. 3. **Practical Applications**: The paper demonstrates the practical applications of this cross-modal modeling in various fields, such as predicting biomolecular properties, generating molecular descriptions, and retrieving biomolecular data from text. 4. **Resources and Datasets**: The paper compiles existing relevant resources and datasets to facilitate further advancement in this field by researchers. 5. **Future Directions**: The paper points out the open challenges and future research directions in this field, such as improving the interpretability and generalization capabilities of models. Through these efforts, the paper aims to provide a foundation for interdisciplinary researchers in biology, chemistry, and artificial intelligence to gain a comprehensive understanding of current technologies and future potentials.

Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey

Bridging Text and Molecule: A Survey on Multimodal Frameworks for Molecule

InstructBioMol: Advancing Biomolecule Understanding and Design Following Human Instructions

Multimodal Large Language Models for Bioimage Analysis

MolLM : a unified language model for integrating biomedical text with 2D and 3D molecular representations

Interactive Molecular Discovery with Natural Language

A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language

Large Language Models for Biomolecular Analysis: from Methods to Applications

Scientific Language Modeling: A Quantitative Review of Large Language Models in Molecular Science

AI for Biomedicine in the Era of Large Language Models

MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension

MolBind: Multimodal Alignment of Language, Molecules, and Proteins

Towards 3D Molecule-Text Interpretation in Language Models

A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals

BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations

Leveraging LLMs for Automated Analysis of Biomedical Data

Advances in Modeling of Biomolecular Interactions

Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing

3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization

ChatMol: Interactive Molecular Discovery with Natural Language

A Survey for Large Language Models in Biomedicine