Boosting Convolutional Neural Networks' Protein Binding Site Prediction Capacity Using SE(3)-invariant transformers, Transfer Learning and Homology-based Augmentation

Daeseok Lee,Jeunghyun Byun,Bonggun Shin
2023-04-18
Abstract:Figuring out small molecule binding sites in target proteins, in the resolution of either pocket or residue, is critical in many virtual and real drug-discovery scenarios. Since it is not always easy to find such binding sites based on domain knowledge or traditional methods, different deep learning methods that predict binding sites out of protein structures have been developed in recent years. Here we present a new such deep learning algorithm, that significantly outperformed all state-of-the-art baselines in terms of the both resolutions$\unicode{x2013}$pocket and residue. This good performance was also demonstrated in a case study involving the protein human serum albumin and its binding sites. Our algorithm included new ideas both in the model architecture and in the training method. For the model architecture, it incorporated SE(3)-invariant geometric self-attention layers that operate on top of residue-level CNN outputs. This residue-level processing of the model allowed a transfer learning between the two resolutions, which turned out to significantly improve the binding pocket prediction. Moreover, we developed novel augmentation method based on protein homology, which prevented our model from over-fitting. Overall, we believe that our contribution to the literature is twofold. First, we provided a new computational method for binding site prediction that is relevant to real-world applications, as shown by the good performance on different benchmarks and case study. Second, the novel ideas in our method$\unicode{x2013}$the model architecture, transfer learning and the homology augmentation$\unicode{x2013}$would serve as useful components in future works.
Quantitative Methods,Machine Learning,Neural and Evolutionary Computing,Biomolecules
What problem does this paper attempt to address?
The paper aims to address the problem of Binding Site Prediction (BSP) for proteins, particularly in identifying binding sites for small molecules on target proteins. Specifically, the authors propose a novel deep learning algorithm to significantly improve the performance of binding site prediction, especially at the pocket and residue levels. The algorithm includes the following innovations: 1. **Model Architecture**: It employs SE(3)-invariant geometric self-attention layers, which operate based on the output of residue-level Convolutional Neural Networks (CNNs), enabling transfer learning across different levels. 2. **Training Method**: It utilizes transfer learning between the Binding Site Detection (BSD) and Binding Residue Identification (BRI) tasks by initializing some parameters to enhance the performance of the BSD module. 3. **Data Augmentation**: A novel data augmentation method based on protein homology is developed to prevent model overfitting. These innovations enable the new algorithm to perform excellently in multiple benchmark tests and case studies (such as human serum albumin), with significant performance improvements in the BRI task. Additionally, ablation experiments validate the effectiveness of each component of the model and explore their impact on the BSD task. In summary, the paper presents a structured BSP solution with practical application potential and provides valuable components and techniques for future research.