An ensemble approach to predict binding hotspots in protein–RNA interactions based on SMOTE data balancing and Random Grouping feature selection strategies

Tong Zhou,Jie Rong,Yang Liu,Weikang Gong,Chunhua Li
DOI: https://doi.org/10.1093/bioinformatics/btac138
IF: 5.8
2022-03-07
Bioinformatics
Abstract:Abstract Motivation The identification of binding hotspots in protein–RNA interactions is crucial for understanding their potential recognition mechanisms and drug design. The experimental methods have many limitations, since they are usually time-consuming and labor-intensive. Thus, developing an effective and efficient theoretical method is urgently needed. Results Here, we present SREPRHot, a method to predict hotspots, defined as the residues whose mutation to alanine generate a binding free energy change ≥2.0 kcal/mol, while others use a cutoff of 1.0 kcal/mol to obtain balanced datasets. To deal with the dataset imbalance, Synthetic Minority Over-sampling Technique (SMOTE) is utilized to generate minority samples to achieve a dataset balance. Additionally, besides conventional features, we use two types of new features, residue interface propensity previously developed by us, and topological features obtained using node-weighted networks, and propose an effective Random Grouping feature selection strategy combined with a two-step method to determine an optimal feature set. Finally, a stacking ensemble classifier is adopted to build our model. The results show SREPRHot achieves a good performance with SEN, MCC and AUC of 0.900, 0.557 and 0.829 on the independent testing dataset. The comparison study indicates SREPRHot shows a promising performance. Availability and implementation The source code is available at https://github.com/ChunhuaLiLab/SREPRHot. Supplementary information Supplementary data are available at Bioinformatics online.
biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology
What problem does this paper attempt to address?