Abstract:In this work we present a system based on a Deep Learning approach, by using a Convolutional Neural Network, capable of classifying protein chains of amino acids based on the protein description contained in the Protein Data Bank. Each protein is fully described in its chemical-physical-geometric properties in a file in XML format. The aim of the work is to design a prototypical Deep Learning machinery for the collection and management of a huge amount of data and to validate it through its application to the classification of a sequences of amino acids. We envisage applying the described approach to more general classification problems in biomolecules, related to structural properties and similarities.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to perform binary classification on protein chains by using machine - learning methods, especially convolutional neural networks (CNNs), that is, to determine whether a given amino acid sequence represents a "real" protein. The researchers obtained data from the Protein Data Bank (PDB). Each protein is stored in XML format based on a detailed description file of its chemical - physical - geometric properties. The main goal of the study was to design a prototype deep - learning system for collecting and managing large amounts of data and to verify the effectiveness of this system through its application in amino acid sequence classification. In addition, the researchers also plan to apply this method to a broader range of biomolecular classification problems, involving the analysis of structural properties and similarities. The specific challenges mentioned in the paper include dealing with the size of the data set (more than 160,000 records, with the original compressed data size of 57.6 GB and more than 2 TB after decompression), the variation in protein sequence lengths (ranging from 1 to 1,000,000 amino acids), and the need for pre - processing the data to meet the input requirements of the neural network. The researchers chose an unconventional CNN - based method. Although for sequence data processing, recurrent neural networks (RNNs) may be a more intuitive choice. However, it was found that CNNs can provide satisfactory results and have lower hardware requirements and can be more easily implemented on multiple platforms. Through this research, the authors hope to explore the potential of deep learning in modeling the relationship between protein structure and function, especially in the analysis of multi - scale dynamic characteristics, which has always been one of the difficult problems in traditional molecular dynamics simulations. Future work will focus on further optimizing the model, for example, by adjusting hyper - parameters, improving the data preparation and management process, and reducing the number of "true negative" results caused by training data padding. In addition, the researchers also consider expanding the application range of the CNN model to analyze more subtle characteristics of protein amino acid sequences and the classification of their molecular structures and energies.

Binary classification of proteins by a Machine Learning approach

Prediction of Functional Class of Proteins and Peptides Irrespective of Sequence Homology by Support Vector Machines.

Deep Learning Methods for Protein Family Classification on PDB Sequencing Data

Macromolecule Classification Based on the Amino-acid Sequence

Hybrid Transformer and Neural Network Configuration for Protein Classification Using Amino Acids

An Efficient Deep Learning Approach for DNA-Binding Proteins Classification from Primary Sequences

urgical management of aortopulmonary window ssociated with interrupted aortic arch : A Congenital Heart urgeons Society study

Classification of Macromolecule Type Based on Sequences of Amino Acids Using Deep Learning

Deep learning-based proteomics enables accurate classification of bulk and single-cell samples

Protein sequence classification using natural language processing techniques

Deep Learning of Protein Structural Classes: Any Evidence for an 'Urfold'?

Predicting human protein function with multi-task deep neural networks

Protein Structure Prediction: Conventional and Deep Learning Perspectives

ProtienCNN‐BLSTM: An efficient deep neural network with amino acid embedding‐based model of protein sequence classification and biological analysis

A Graph Neural Network Approach

Multi-task Deep Neural Networks in Automated Protein Function Prediction

SBSM-Pro: Support Bio-sequence Machine for Proteins

DeepProtein: Deep Learning Library and Benchmark for Protein Sequence Learning

Computational Protein Design with Deep Learning Neural Networks

A Review of Deep Learning Techniques for Protein Function Prediction

PRED-CLASS: cascading neural networks for generalized protein classification and genome-wide applications