Abstract:Background: Proteins and nucleic acids are vital biomolecules that contribute significantly to biological life. The precise and efficient identification of hot spots at protein-nucleic acid interfaces is crucial for guiding drug development, advancing protein engineering, and exploring the underlying molecular recognition mechanisms. As experimental methods like alanine scanning mutagenesis prove to be time-consuming and expensive, a growing number of machine learning techniques are being employed to predict hot spots. However, the existing approach is distinguished by a lack of uniform standards, a scarcity of data, and a wide range of attributes. Currently, there is no comprehensive overview or evaluation of this field. As a result, providing a full overview and review is extremely helpful. Methods: In this study, we present an overview of cutting-edge machine learning approaches utilized for hot spot prediction in protein-nucleic acid complexes. Additionally, we outline the feature categories currently in use, derived from relevant biological data sources, and assess conventional feature selection methods based on 600 extracted features. Simultaneously, we create two new benchmark datasets, PDHS87 and PRHS48, and develop distinct binary classification models based on these datasets to evaluate the advantages and disadvantages of various machine-learning techniques. Results: Prediction of protein-nucleic acid interaction hotspots is a challenging task. The study demonstrates that structural neighborhood features play a crucial role in identifying hot spots. The prediction performance can be improved by choosing effective feature selection methods and machine learning methods. Among the existing prediction methods, XGBPRH has the best performance. Conclusion: It is crucial to continue studying hot spot theories, discover new and effective features, add accurate experimental data, and utilize DNA/RNA information. Semi-supervised learning, transfer learning, and ensemble learning can optimize predictive ability. Combining computational docking with machine learning methods can potentially further improve predictive performance.

Identification of DNA adduct formation of small molecules by molecular descriptors and machine learning methods

Predicting the Androgenicity of Structurally Diverse Compounds from Molecular Structure Using Different Classifiers

Distance-based support vector machine to predict DNA N6-methyladenine modification

Machine learning based predictive analysis of DNA cleavage induced by diverse nanomaterials

Analyzing DNA Hybridization via machine learning

Prediction of nucleic acid-binding proteins using support vector machines

Identification of D Modification Sites Using a Random Forest Model Based on Nucleotide Chemical Properties

In Silico Identification of Human Pregnane X Receptor Activators from Molecular Descriptors by Machine Learning Approaches

Predicting DNA Reactions with a Quantum Chemistry‐Based Deep Learning Model

Prediction of Genotoxicity of Chemical Compounds by Statistical Learning Methods.

A deep multiple kernel learning-based higher-order fuzzy inference system for identifying DNA N4-methylcytosine sites

Predicting DNA Reactions with a Quantum Chemistry-Based Deep Learning Model

Leveraging the attention mechanism to improve the identification of DNA N6-methyladenine sites

A Machine Learning Approach to Calculate Electronic Couplings Between Quasi-Diabatic Molecular Orbitals: the Case of DNA

Efficient Prediction of DNA-Binding Proteins Using Machine Learning

Predicting DNA structure using a deep learning method

Weighted Fuzzy System for Identifying DNA N4-Methylcytosine Sites With Kernel Entropy Component Analysis

Computational Methods for Predicting DNA Binding Proteins

Prediction of m5C Modifications in RNA Sequences by Combining Multiple Sequence Features

Thorough Assessment of Machine Learning Techniques for Predicting Protein-Nucleic Acid Binding Hot Spots

PSATF-6mA: an integrated learning fusion feature-encoded DNA-6 mA methylcytosine modification site recognition model based on attentional mechanisms