Abstract:Advanced automated AI techniques allow us to classify protein sequences and discern their biological families and functions. Conventional approaches for classifying these protein families often focus on extracting N-Gram features from the sequences while overlooking crucial motif information and the interplay between motifs and neighboring amino acids. Recently, convolutional neural networks have been applied to amino acid and motif data, even with a limited dataset of well-characterized proteins, resulting in improved performance. This study presents a model for classifying protein families using the fusion of 1D-CNN, BiLSTM, and an attention mechanism, which combines spatial feature extraction, long-term dependencies, and context-aware representations. The proposed model (ProFamNet) achieved superior model efficiency with 450,953 parameters and a compact size of 1.72 MB, outperforming the state-of-the-art model with 4,578,911 parameters and a size of 17.47 MB. Further, we achieved a higher F1 score (98.30% vs. 97.67%) with more instances (271,160 vs. 55,077) in fewer training epochs (25 vs. 30).

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in protein sequence classification, traditional methods often only focus on extracting N - Gram features from the sequence, while ignoring important motif information and the interaction between motifs and neighboring amino acids. This has led to low accuracy and efficiency in protein family classification. To solve this problem, the author proposes a deep - learning model (ProFamNet) based on the fusion of 1D - CNN, BiLSTM and the attention mechanism, aiming to combine spatial feature extraction, long - range dependence and context - aware representation to improve the performance of protein family classification. ### Main contributions of the paper: 1. **Introduction of a new model**: A new model named ProFamNet for protein family classification is proposed. 2. **Fusion of multiple techniques**: 1D - CNN, BiLSTM and the attention mechanism are combined to fully utilize the advantages of these techniques. 3. **Improvement of model efficiency**: Compared with the existing state - of - the - art models, ProFamNet has fewer parameters (450,953 vs. 4,578,911) and a smaller model size (1.72 MB vs. 17.47 MB). 4. **Reduction of training time and resource consumption**: By reducing the number of layers and training epochs (25 vs. 30), the training time and resource consumption are significantly reduced. 5. **Achievement of higher F1 scores on multiple labels**: Higher F1 scores (98.30% vs. 97.67%) are achieved on more instances (271,160 vs. 55,077), demonstrating the effectiveness of the model in bioinformatics. ### Model architecture: 1. **Encoding module**: Each amino acid is quantified as a numerical value and converted into an integer array. Amino acids are represented as 24 different numbers. 2. **Embedding module**: Each quantified amino acid value is converted into a continuous vector for subsequent processing. 3. **1D - CNN module**: Non - linear features in the protein sequence are extracted through convolution operations, motifs are discovered and high - level associations are strengthened. 4. **BiLSTM module**: Long - term dependencies and context information in the sequence are captured by bidirectionally processing the input data. 5. **Attention mechanism**: The attention mechanism enhances the model's focus on important features and improves classification performance. ### Mathematical formulas: - **Convolution operation**: \[ f[i]=\langle C[*, i:i + k - 1], H\rangle \] where \(C[i:i + k - 1]\) represents the \(i\) - th column to the \((i + k - 1)\) - th column of the input matrix \(C\), and \(\langle.,.\rangle\) represents the inner product operation. - **Activation function**: \[ y = \max_i\text{ReLU}(f[i]+b) \] - **Convolution calculation**: \[ y(j)=\sum_{x = 1}^{k}h(x)\cdot c(j\cdot s - x + k - s + 1) \] where \(k - s + 1\) is an offset constant. - **Feature map size**: \[ M=\frac{L - k}{s}+1 \] where \(L\) is the total length of the protein sequence, \(k\) is the kernel size, and \(s\) is the stride. ### Summary: ProFamNet effectively solves the deficiencies of traditional methods in protein family classification by fusing 1D - CNN, BiLSTM and the attention mechanism, and improves the efficiency and classification performance of the model. This research provides a new and efficient method for protein classification in the field of bioinformatics.

A Fusion-Driven Approach of Attention-Based CNN-BiLSTM for Protein Family Classification -- ProFamNet

ProtienCNN‐BLSTM: An efficient deep neural network with amino acid embedding‐based model of protein sequence classification and biological analysis

Optimizing protein sequence classification: integrating deep learning models with Bayesian optimization for enhanced biological analysis

Protein sequence classification using natural language processing techniques

TAWFN: A Deep Learning Framework for Protein Function Prediction

Deep Learning Methods for Protein Family Classification on PDB Sequencing Data

Lite-SeqCNN: A Light-Weight Deep CNN Architecture for Protein Function Prediction

Investigation of protein family relationships with deep learning

Graph neural networks and attention-based CNN-LSTM for protein classification

Bi-SeqCNN: A Novel Light-weight Bi-directional CNN Architecture for Protein Function Prediction

Hybrid Transformer and Neural Network Configuration for Protein Classification Using Amino Acids

Performing protein fold recognition by exploiting a stack convolutional neural network with the attention mechanism

ILMCNet: A Deep Neural Network Model That Uses PLM to Process Features and Employs CRF to Predict Protein Secondary Structure

Research on DNA-Binding Protein Identification Method Based on LSTM-CNN Feature Fusion

An Efficient Deep Learning Approach for DNA-Binding Proteins Classification from Primary Sequences

ProtPlat: an efficient pre-training platform for protein classification based on FastText

PSSP-MFFNet: A Multifeature Fusion Network for Protein Secondary Structure Prediction

MUST-CNN: A Multilayer Shift-and-Stitch Deep Convolutional Architecture for Sequence-based Protein Structure Prediction

Protein Secondary Structure Prediction Using Deep Multi-scale Convolutional Neural Networks and Next-Step Conditioning

Protein Fold Recognition From Sequences Using Convolutional and Recurrent Neural Networks

An Artificial Intelligence-Based Stacked Ensemble Approach for Prediction of Protein Subcellular Localization in Confocal Microscopy Images