GPCR-BERT: Interpreting Sequential Design of G Protein Coupled Receptors Using Protein Language Models

Seongwon Kim,Parisa Mollaei,Akshay Antony,Rishikesh Magar,Amir Barati Farimani
2023-10-31
Abstract:With the rise of Transformers and Large Language Models (LLMs) in Chemistry and Biology, new avenues for the design and understanding of therapeutics have opened up to the scientific community. Protein sequences can be modeled as language and can take advantage of recent advances in LLMs, specifically with the abundance of our access to the protein sequence datasets. In this paper, we developed the GPCR-BERT model for understanding the sequential design of G Protein-Coupled Receptors (GPCRs). GPCRs are the target of over one-third of FDA-approved pharmaceuticals. However, there is a lack of comprehensive understanding regarding the relationship between amino acid sequence, ligand selectivity, and conformational motifs (such as NPxxY, CWxP, E/DRY). By utilizing the pre-trained protein model (Prot-Bert) and fine-tuning with prediction tasks of variations in the motifs, we were able to shed light on several relationships between residues in the binding pocket and some of the conserved motifs. To achieve this, we took advantage of attention weights, and hidden states of the model that are interpreted to extract the extent of contributions of amino acids in dictating the type of masked ones. The fine-tuned models demonstrated high accuracy in predicting hidden residues within the motifs. In addition, the analysis of embedding was performed over 3D structures to elucidate the higher-order interactions within the conformations of the receptors.
Machine Learning,Biomolecules
What problem does this paper attempt to address?
The main objective of this paper is to develop a model named GPCR-BERT to deeply understand the higher-order interactions in the sequence design of G protein-coupled receptors (GPCRs) and to explore the relationship between conserved motifs in these receptors and their functions. Specifically, the paper aims to address the following key issues: 1. **Correlation between conserved region variations and other amino acids**: Investigate the correlation between variations within conserved motifs in GPCRs (such as NPxxY, CWxP, and E/DRY) and amino acids in other sequences. 2. **Possibility of predicting the complete sequence from partial sequences**: Explore whether it is possible to predict the entire amino acid sequence based on partially known sequences of GPCRs. 3. **Identification of key amino acids**: Identify which amino acids contribute the most to conformational changes in GPCRs and may play important roles in receptor function. To achieve the above objectives, the researchers adopted a large language model (LLM)-based approach, specifically utilizing the pre-trained protein language model Prot-BERT and fine-tuning it for GPCRs. By analyzing attention weights and hidden states, the researchers were able to reveal the roles of different amino acids in determining the specific amino acid types within conserved motifs. Additionally, the paper compared the performance of GPCR-BERT with other machine learning models (such as the original BERT and SVM), demonstrating the superior performance of GPCR-BERT in prediction tasks. Through this series of studies, the paper not only provides new insights into the sequence design of GPCRs but also demonstrates how advanced natural language processing techniques can be used to understand and predict the functional characteristics of biomolecules. This offers an important theoretical foundation and technical means for future drug design and protein engineering.