Abstract:Binding sites are the key interfaces that determine a protein's biological activity, and therefore common targets for therapeutic intervention. Techniques that help us detect, compare and contextualise binding sites are hence of immense interest to drug discovery. Here we present an approach that integrates protein language models with a 3D tesselation technique to derive rich and versatile representations of binding sites that combine functional, structural and evolutionary information with unprecedented detail. We demonstrate that the associated similarity metrics induce meaningful pocket clusterings by balancing local structure against global sequence effects. The resulting embeddings are shown to simplify a variety of downstream tasks: they help organise the pocketome in a way that efficiently contextualises new binding sites, construct performant druggability models, and define challenging train-test splits for believable benchmarking of pocket-centric machine-learning models.

What problem does this paper attempt to address?

The paper aims to address the problem of protein binding site identification and understanding their behavior, particularly in the context of target discovery and structure-based molecular design. Specifically, the paper proposes a new method called EPoCS (ESM-driven Pocket Cross-Similarity) for generating representations of protein binding sites and comparing the similarity between different sites through multi-scale features. ### Main Issues: 1. **Multi-scale Representation**: There is currently a lack of transferable metrics that balance structural, physicochemical, sequence, and functional information. 2. **Application of Protein Language Models**: Protein language models (such as ESM-2) can capture functional, evolutionary, and structural relationships at a low cost. 3. **Benchmarking Machine Learning Models**: Existing train-test set splits have data leakage issues, leading to inaccurate performance estimates. ### Solutions: - **EPoCS Method**: By combining 3D structure processing techniques and protein language models (such as ESM-2), a universal and powerful metric for protein binding site similarity is constructed. - **Multi-scale Representation**: By mapping embedding vectors onto 3D structures, a multi-scale description from local structure to global sequence is achieved. - **Visual Query and Context Matching**: Embedding vectors generated by EPoCS are used for visual queries and real-time context matching. - **Unbiased Benchmarking**: Improved benchmarking of machine learning models through reasonable train-test set splits, avoiding data leakage issues. ### Experimental Validation: - **Similarity Metric Comparison**: EPoCS performs well in multiple benchmarks, capturing local, regional, and global similarity structures. - **Clustering Analysis**: Clustering maps generated through hierarchical clustering techniques demonstrate the functional and structural relationships of binding sites. - **Unbiased Train-Test Set Splits**: Using EPoCS, a series of progressively challenging train-test sets were generated to evaluate the model's generalization ability and robustness. In summary, the paper proposes a new method that better describes and compares protein binding sites on a multi-scale, thereby improving research work in related fields.

Mapping the space of protein binding sites with sequence-based protein language models

Genome-scale annotation of protein binding sites via language model and geometric deep learning

Language models can identify enzymatic binding sites in protein sequences

DeepProSite: Structure-aware Protein Binding Site Prediction Using ESMFold and Pretrained Language Model

Protein Language Model-Powered 3D Ligand Binding Site Prediction from Protein Sequence

Protein embeddings predict binding residues in disordered regions

Protein-Ligand Binding Site Recognition Using Complementary Binding-Specific Substructure Comparison And Sequence Profile Alignment

BAPULM: Binding Affinity Prediction using Language Models

Exploiting Sequence and Structure Homologs to Identify Protein-Protein Binding Sites

When Protein Structure Embedding Meets Large Language Models

Binding Site Prediction for Protein-Protein Interactions and Novel Motif Discovery using Re-occurring Polypeptide Sequences

Leveraging binding-site structure for drug discovery with point-cloud methods

Protein-small molecule binding site prediction based on a pre-trained protein language model with contrastive learning

Identification of Cavities on Protein Surface Using Multiple Computational Approaches for Drug Binding Site Prediction

μMap Photoproximity Labeling Enables Small Molecule Binding Site Mapping

Identification of Protein-Ligand Binding Sites by Sequence Information and Ensemble Classifier.

Site2Vec: a reference frame invariant algorithm for vector embedding of protein–ligand binding sites

Exploring the computational methods for protein-ligand binding site prediction

Identification of Enzymatic Active Sites with Unsupervised Language Modeling