Binary classification of proteins by a Machine Learning approach

Damiano Perri,Marco Simonetti,Andrea Lombardi,Noelia Faginas-Lago,Osvaldo Gervasi
DOI: https://doi.org/10.1007/978-3-030-58820-5_41
2021-11-03
Abstract:In this work we present a system based on a Deep Learning approach, by using a Convolutional Neural Network, capable of classifying protein chains of amino acids based on the protein description contained in the Protein Data Bank. Each protein is fully described in its chemical-physical-geometric properties in a file in XML format. The aim of the work is to design a prototypical Deep Learning machinery for the collection and management of a huge amount of data and to validate it through its application to the classification of a sequences of amino acids. We envisage applying the described approach to more general classification problems in biomolecules, related to structural properties and similarities.
Machine Learning,Biomolecules
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to perform binary classification on protein chains by using machine - learning methods, especially convolutional neural networks (CNNs), that is, to determine whether a given amino acid sequence represents a "real" protein. The researchers obtained data from the Protein Data Bank (PDB). Each protein is stored in XML format based on a detailed description file of its chemical - physical - geometric properties. The main goal of the study was to design a prototype deep - learning system for collecting and managing large amounts of data and to verify the effectiveness of this system through its application in amino acid sequence classification. In addition, the researchers also plan to apply this method to a broader range of biomolecular classification problems, involving the analysis of structural properties and similarities. The specific challenges mentioned in the paper include dealing with the size of the data set (more than 160,000 records, with the original compressed data size of 57.6 GB and more than 2 TB after decompression), the variation in protein sequence lengths (ranging from 1 to 1,000,000 amino acids), and the need for pre - processing the data to meet the input requirements of the neural network. The researchers chose an unconventional CNN - based method. Although for sequence data processing, recurrent neural networks (RNNs) may be a more intuitive choice. However, it was found that CNNs can provide satisfactory results and have lower hardware requirements and can be more easily implemented on multiple platforms. Through this research, the authors hope to explore the potential of deep learning in modeling the relationship between protein structure and function, especially in the analysis of multi - scale dynamic characteristics, which has always been one of the difficult problems in traditional molecular dynamics simulations. Future work will focus on further optimizing the model, for example, by adjusting hyper - parameters, improving the data preparation and management process, and reducing the number of "true negative" results caused by training data padding. In addition, the researchers also consider expanding the application range of the CNN model to analyze more subtle characteristics of protein amino acid sequences and the classification of their molecular structures and energies.