SeqDance: A Protein Language Model for Representing Protein Dynamic Properties

Chao Hou,Yufeng Shen
DOI: https://doi.org/10.1101/2024.10.11.617911
2024-10-15
Abstract:Proteins perform their functions by folding amino acid sequences into dynamic structural ensembles. Despite the important role of protein dynamics, their complexity and the absence of efficient representation methods have limited their integration into studies on protein function and mutation fitness, especially in deep learning applications. To address this, we present SeqDance, a protein language model designed to learn representation of protein dynamic properties directly from sequence alone. SeqDance is pre-trained on dynamic biophysical properties derived from over 30,400 molecular dynamics trajectories and 28,600 normal mode analyses. Our results show that SeqDance effectively captures local dynamic interactions, co-movement patterns, and global conformational features, even for proteins lacking homologs in the pre-training set. Additionally, we showed that SeqDance enhances the prediction of protein fitness landscapes, disorder-to-order transition binding regions, and phase-separating proteins. By learning dynamic properties from sequence, SeqDance complements conventional evolution- and static structure-based methods, offering new insights into protein behavior and function.
Bioinformatics
What problem does this paper attempt to address?
The paper attempts to address the issue of insufficient representation of protein dynamic properties in existing research. Specifically, although the function of proteins largely depends on their dynamic structural ensembles, current methods face complexity and efficiency issues in representing these dynamic properties, limiting their application in studies of protein function and mutation fitness, especially in deep learning applications. To overcome this challenge, the authors propose SeqDance, a language model specifically designed to learn dynamic properties from protein sequences. ### Main Issues: 1. **Representation of Protein Dynamic Properties**: Existing methods mainly rely on static structures or evolutionary information to represent protein properties, but these methods fail to fully capture the dynamic behaviors of proteins, such as local dynamic interactions, co-motion patterns, and global conformational features. 2. **Computational Efficiency and Data Complexity**: Molecular dynamics (MD) simulations and normal mode analysis (NMA) can generate rich dynamic data, but this data is usually high-dimensional and irregular, making it difficult to directly apply to deep learning models. 3. **Applicability to Proteins without Homologous Sequences**: Many proteins lack known homologous sequences, making methods based on evolutionary information less effective in these cases. ### Solution: - **SeqDance Model**: By pre-training the SeqDance model using over 30,400 molecular dynamics trajectories and 28,600 normal mode analysis data, it learns the dynamic properties of proteins. SeqDance can directly predict dynamic features from sequences, including local dynamic interactions, co-motion patterns, and global conformational features. - **Dynamic Feature Extraction**: Rich residue-level and pairwise dynamic features are extracted from MD simulations and NMA, and these features are used to pre-train the SeqDance model. - **Generalization Ability**: SeqDance performs well not only on proteins with homologous sequences but also effectively applies to proteins lacking homologous sequences, providing new biological insights. ### Applications: - **Protein Fitness Landscape Prediction**: SeqDance excels in predicting protein fitness landscapes, particularly the impact of mutations on protein folding stability. - **Disorder-to-Order Transition Binding Region Prediction**: SeqDance can identify binding regions undergoing disorder-to-order transitions. - **Phase Separation Protein Prediction**: SeqDance aids in predicting phase separation proteins. In summary, SeqDance addresses the shortcomings of existing methods by learning dynamic properties from protein sequences, providing new tools and perspectives for the study of protein function and mutation fitness.