ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding

Yijia Xiao,Edward Sun,Yiqiao Jin,Qifan Wang,Wei Wang

2024-08-21

Abstract:Understanding biological processes, drug development, and biotechnological advancements requires detailed analysis of protein structures and sequences, a task in protein research that is inherently complex and time-consuming when performed manually. To streamline this process, we introduce ProteinGPT, a state-of-the-art multi-modal protein chat system, that allows users to upload protein sequences and/or structures for comprehensive protein analysis and responsive inquiries. ProteinGPT seamlessly integrates protein sequence and structure encoders with linear projection layers for precise representation adaptation, coupled with a large language model (LLM) to generate accurate and contextually relevant responses. To train ProteinGPT, we construct a large-scale dataset of 132,092 proteins with annotations, and optimize the instruction-tuning process using GPT-4o. This innovative system ensures accurate alignment between the user-uploaded data and prompts, simplifying protein analysis. Experiments show that ProteinGPT can produce promising responses to proteins and their corresponding questions.

Artificial Intelligence,Computational Engineering, Finance, and Science,Machine Learning,Biomolecules

What problem does this paper attempt to address?

The paper aims to address the complexity and time-consuming issues in protein structure and sequence analysis. Specifically, the paper proposes ProteinGPT, a multimodal large language model system for protein property prediction and structural understanding. By integrating protein sequence and structure information, ProteinGPT can facilitate interactive dialogues related to proteins, thereby significantly enhancing the understanding and design capabilities of proteins. The main objectives include: 1. **Multimodal Integration**: Combining protein sequence and structure data to extract more comprehensive information, including evolutionary information, functional sites, and sequence-structure relationships. 2. **Simplifying the Analysis Process**: Simplifying traditional manual analysis workflows through an automated system, reducing reliance on tedious experiments and literature searches. 3. **Improving Accuracy**: Utilizing large-scale datasets and advanced training methods to ensure the model can accurately understand and answer questions about proteins. 4. **Developing High-Quality Datasets**: Constructing a large-scale dataset, ProteinQA, containing 132,092 protein samples for model training and instruction tuning. Through these efforts, ProteinGPT can provide more efficient and accurate support in fields such as biological research, drug discovery, and medical engineering.

ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding

ProtChatGPT: Towards Understanding Proteins with Large Language Models

Global-Context Aware Generative Protein Design

Protein Design with StructureGPT: a Deep Learning Model for Protein Structure-to-Sequence Translation

RNA-GPT: Multimodal Generative System for RNA Sequence Understanding

Multi-Modal Large Language Model Enables Protein Function Prediction

PB-GPT: An innovative GPT-based model for protein backbone generation

ProtT3: Protein-to-Text Generation for Text-based Protein Understanding

CPE-Pro: A Structure-Sensitive Deep Learning Method for Protein Representation and Origin Evaluation

CPE-Pro: A Structure-Sensitive Deep Learning Model for Protein Representation and Origin Evaluation

Protein 3D Graph Structure Learning for Robust Structure-based Protein Property Prediction

Peptide-GPT: Generative Design of Peptides using Generative Pre-trained Transformers and Bio-informatic Supervision

GGN-GO: geometric graph networks for predicting protein function by multi-scale structure features

Unifying Sequences, Structures, and Descriptions for Any-to-Any Protein Generation with the Large Multimodal Model HelixProtX

GPSFun: geometry-aware protein sequence function predictions with language models

PROTGOAT : Improved automated protein function predictions using Protein Language Models

ProtGPT2 is a deep unsupervised language model for protein design

Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding

InstructPLM: Aligning Protein Language Models to Follow Protein Structure Instructions

OntoProtein: Protein Pretraining With Gene Ontology Embedding

Integration of molecular coarse-grained model into geometric representation learning framework for protein-protein complex property prediction