Abstract:Abstract Background The interactions of proteins are determined by their sequences and affect the regulation of the cell cycle, signal transduction and metabolism, which is of extraordinary significance to modern proteomics research. Despite advances in experimental technology, it is still expensive, laborious, and time-consuming to determine protein–protein interactions (PPIs), and there is a strong demand for effective bioinformatics approaches to identify potential PPIs. Considering the large amount of PPI data, a high-performance processor can be utilized to enhance the capability of the deep learning method and directly predict protein sequences. Results We propose the Sequence-Statistics-Content protein sequence encoding format (SSC) based on information extraction from the original sequence for further performance improvement of the convolutional neural network. The original protein sequences are encoded in the three-channel format by introducing statistical information (the second channel) and bigram encoding information (the third channel), which can increase the unique sequence features to enhance the performance of the deep learning model. On predicting protein–protein interaction tasks, the results using the 2D convolutional neural network (2D CNN) with the SSC encoding method are better than those of the 1D CNN with one hot encoding. The independent validation of new interactions from the HIPPIE database (version 2.1 published on July 18, 2017) and the validation of directly predicted results by applying a molecular docking tool indicate the effectiveness of the proposed protein encoding improvement in the CNN model. Conclusion The proposed protein sequence encoding method is efficient at improving the capability of the CNN model on protein sequence-related tasks and may also be effective at enhancing the capability of other machine learning or deep learning methods. Prediction accuracy and molecular docking validation showed considerable improvement compared to the existing hot encoding method, indicating that the SSC encoding method may be useful for analyzing protein sequence-related tasks. The source code of the proposed methods is freely available for academic research at https://github.com/wangy496/SSC-format/ .

Lite-SeqCNN: A Light-Weight Deep CNN Architecture for Protein Function Prediction

Bi-SeqCNN: A Novel Light-weight Bi-directional CNN Architecture for Protein Function Prediction

ProtienCNN‐BLSTM: An efficient deep neural network with amino acid embedding‐based model of protein sequence classification and biological analysis

Leveraging Sequence Embedding and Convolutional Neural Network for Protein Function Prediction

MUST-CNN: A Multilayer Shift-and-Stitch Deep Convolutional Architecture for Sequence-based Protein Structure Prediction

DEEPGONET: Multi-label Prediction of GO Annotation for Protein from Sequence Using Cascaded Convolutional and Recurrent Network

DeepGOPlus: improved protein function prediction from sequence

TAWFN: A Deep Learning Framework for Protein Function Prediction

Performance improvement for a 2D convolutional neural network by using SSC encoding on protein–protein interaction tasks

Protein Secondary Structure Prediction Using Deep Multi-scale Convolutional Neural Networks and Next-Step Conditioning

PepCNN deep learning tool for predicting peptide binding residues in proteins using sequence, structural, and language model features

An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences

Sequence-based Protein-Protein Interaction Prediction Using Multi-kernel Deep Convolutional Neural Networks with Protein Language Model

DeepLA: A deep learning-based model for predicting protein function from protein sequence and evolutionary information.

A Fusion-Driven Approach of Attention-Based CNN-BiLSTM for Protein Family Classification -- ProFamNet

A Protein Structure Prediction Approach Leveraging Transformer and CNN Integration

Convolutions are competitive with transformers for protein sequence pretraining

ILMCNet: A Deep Neural Network Model That Uses PLM to Process Features and Employs CRF to Predict Protein Secondary Structure

Deep Learning Methods for Protein Family Classification on PDB Sequencing Data

ProtTrans and Multi-Window Scanning Convolutional Neural Networks for the Prediction of Protein-Peptide Interaction Sites

Protein secondary structure prediction using deep convolutional neural fields