ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding

Yijia Xiao,Edward Sun,Yiqiao Jin,Qifan Wang,Wei Wang
2024-08-21
Abstract:Understanding biological processes, drug development, and biotechnological advancements requires detailed analysis of protein structures and sequences, a task in protein research that is inherently complex and time-consuming when performed manually. To streamline this process, we introduce ProteinGPT, a state-of-the-art multi-modal protein chat system, that allows users to upload protein sequences and/or structures for comprehensive protein analysis and responsive inquiries. ProteinGPT seamlessly integrates protein sequence and structure encoders with linear projection layers for precise representation adaptation, coupled with a large language model (LLM) to generate accurate and contextually relevant responses. To train ProteinGPT, we construct a large-scale dataset of 132,092 proteins with annotations, and optimize the instruction-tuning process using GPT-4o. This innovative system ensures accurate alignment between the user-uploaded data and prompts, simplifying protein analysis. Experiments show that ProteinGPT can produce promising responses to proteins and their corresponding questions.
Artificial Intelligence,Computational Engineering, Finance, and Science,Machine Learning,Biomolecules
What problem does this paper attempt to address?
The paper aims to address the complexity and time-consuming issues in protein structure and sequence analysis. Specifically, the paper proposes ProteinGPT, a multimodal large language model system for protein property prediction and structural understanding. By integrating protein sequence and structure information, ProteinGPT can facilitate interactive dialogues related to proteins, thereby significantly enhancing the understanding and design capabilities of proteins. The main objectives include: 1. **Multimodal Integration**: Combining protein sequence and structure data to extract more comprehensive information, including evolutionary information, functional sites, and sequence-structure relationships. 2. **Simplifying the Analysis Process**: Simplifying traditional manual analysis workflows through an automated system, reducing reliance on tedious experiments and literature searches. 3. **Improving Accuracy**: Utilizing large-scale datasets and advanced training methods to ensure the model can accurately understand and answer questions about proteins. 4. **Developing High-Quality Datasets**: Constructing a large-scale dataset, ProteinQA, containing 132,092 protein samples for model training and instruction tuning. Through these efforts, ProteinGPT can provide more efficient and accurate support in fields such as biological research, drug discovery, and medical engineering.