MULAN: Multimodal Protein Language Model for Sequence and Structure Encoding

Daria Frolova,Marina Pak,Anna Litvin,Ilya Sharov,Dmitry Ivankov,Ivan Oseledets
DOI: https://doi.org/10.1101/2024.05.30.596565
2024-06-02
Abstract:Most protein language models (PLMs), which are used to produce high-quality protein representations, use only protein sequences during training. However, the known protein structure is crucial in many protein property prediction tasks, so there is a growing interest in incorporating the knowledge about the protein structure into a PLM. In this study, we propose MULAN, a MULtimodal PLM for both sequence and ANgle-based structure encoding. MULAN has a pre-trained sequence encoder and an introduced Structure Adapter, which are then fused and trained together. According to the evaluation on 7 downstream tasks of various nature, both small and medium-sized MULAN models show consistent improvement in quality compared to both sequence-only ESM-2 and structure-aware SaProt. Importantly, our model offers a cheap increase in the structural awareness of the protein representations due to finetuning of existing PLMs instead of training from scratch. We perform a detailed analysis of the proposed model and demonstrate its awareness of the protein structure. The implementation, training data and model checkpoints are available at https://github.com/DFrolova/MULAN.
Bioinformatics
What problem does this paper attempt to address?
The main objective of this paper is to propose a new multimodal protein language model (MULAN), aimed at improving the learning of protein representations by integrating protein sequence and structural information. Specifically, the MULAN model combines a sequence encoder with a newly introduced Structure Adapter, which can handle angle-based protein structural information. The key contributions of the paper include: 1. **MULAN Model**: This is a new multimodal protein language model that can simultaneously handle protein sequence and structural data. It includes a pre-trained sequence encoder and a Structure Adapter to process backbone torsion angles and side-chain torsion angles of protein structures. 2. **Structure Adapter**: This is an important component of MULAN, which uses torsion angles in protein structures to represent structural information. It can work on top of existing protein language models without the need to train the entire model from scratch, thus improving the model's understanding of protein structures at a lower cost. 3. **Performance Evaluation**: The paper evaluates the performance of MULAN on various downstream tasks, including protein stability prediction, fluorescence prediction, metal ion binding prediction, human protein-protein interaction prediction, and gene ontology classification. The results show that MULAN outperforms sequence-based models (such as ESM-2) and models that already consider some structural information (such as SaProt) in these tasks. 4. **Detailed Analysis**: The authors also conducted extensive ablation experiments to verify the effectiveness and structure-awareness of the model. For example, they demonstrated the importance of the Structure Adapter by comparing the performance of models under different settings and showed how masking uncertain structural predictions can reduce noise. 5. **Secondary Structure Prediction**: To further validate MULAN's ability to perceive protein structures, the paper also evaluates the model's performance on the protein secondary structure prediction task. The results indicate that MULAN also achieves significant improvements in this task. In summary, this study successfully enhances the utilization of structural information in protein language models by introducing the Structure Adapter and confirms the effectiveness of this approach in multiple downstream tasks.