A Fusion-Driven Approach of Attention-Based CNN-BiLSTM for Protein Family Classification -- ProFamNet

Bahar Ali,Anwar Shah,Malik Niaz,Musadaq Mansoord,Sami Ullah,Muhammad Adnan
2024-10-22
Abstract:Advanced automated AI techniques allow us to classify protein sequences and discern their biological families and functions. Conventional approaches for classifying these protein families often focus on extracting N-Gram features from the sequences while overlooking crucial motif information and the interplay between motifs and neighboring amino acids. Recently, convolutional neural networks have been applied to amino acid and motif data, even with a limited dataset of well-characterized proteins, resulting in improved performance. This study presents a model for classifying protein families using the fusion of 1D-CNN, BiLSTM, and an attention mechanism, which combines spatial feature extraction, long-term dependencies, and context-aware representations. The proposed model (ProFamNet) achieved superior model efficiency with 450,953 parameters and a compact size of 1.72 MB, outperforming the state-of-the-art model with 4,578,911 parameters and a size of 17.47 MB. Further, we achieved a higher F1 score (98.30% vs. 97.67%) with more instances (271,160 vs. 55,077) in fewer training epochs (25 vs. 30).
Quantitative Methods,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in protein sequence classification, traditional methods often only focus on extracting N - Gram features from the sequence, while ignoring important motif information and the interaction between motifs and neighboring amino acids. This has led to low accuracy and efficiency in protein family classification. To solve this problem, the author proposes a deep - learning model (ProFamNet) based on the fusion of 1D - CNN, BiLSTM and the attention mechanism, aiming to combine spatial feature extraction, long - range dependence and context - aware representation to improve the performance of protein family classification. ### Main contributions of the paper: 1. **Introduction of a new model**: A new model named ProFamNet for protein family classification is proposed. 2. **Fusion of multiple techniques**: 1D - CNN, BiLSTM and the attention mechanism are combined to fully utilize the advantages of these techniques. 3. **Improvement of model efficiency**: Compared with the existing state - of - the - art models, ProFamNet has fewer parameters (450,953 vs. 4,578,911) and a smaller model size (1.72 MB vs. 17.47 MB). 4. **Reduction of training time and resource consumption**: By reducing the number of layers and training epochs (25 vs. 30), the training time and resource consumption are significantly reduced. 5. **Achievement of higher F1 scores on multiple labels**: Higher F1 scores (98.30% vs. 97.67%) are achieved on more instances (271,160 vs. 55,077), demonstrating the effectiveness of the model in bioinformatics. ### Model architecture: 1. **Encoding module**: Each amino acid is quantified as a numerical value and converted into an integer array. Amino acids are represented as 24 different numbers. 2. **Embedding module**: Each quantified amino acid value is converted into a continuous vector for subsequent processing. 3. **1D - CNN module**: Non - linear features in the protein sequence are extracted through convolution operations, motifs are discovered and high - level associations are strengthened. 4. **BiLSTM module**: Long - term dependencies and context information in the sequence are captured by bidirectionally processing the input data. 5. **Attention mechanism**: The attention mechanism enhances the model's focus on important features and improves classification performance. ### Mathematical formulas: - **Convolution operation**: \[ f[i]=\langle C[*, i:i + k - 1], H\rangle \] where \(C[i:i + k - 1]\) represents the \(i\) - th column to the \((i + k - 1)\) - th column of the input matrix \(C\), and \(\langle.,.\rangle\) represents the inner product operation. - **Activation function**: \[ y = \max_i\text{ReLU}(f[i]+b) \] - **Convolution calculation**: \[ y(j)=\sum_{x = 1}^{k}h(x)\cdot c(j\cdot s - x + k - s + 1) \] where \(k - s + 1\) is an offset constant. - **Feature map size**: \[ M=\frac{L - k}{s}+1 \] where \(L\) is the total length of the protein sequence, \(k\) is the kernel size, and \(s\) is the stride. ### Summary: ProFamNet effectively solves the deficiencies of traditional methods in protein family classification by fusing 1D - CNN, BiLSTM and the attention mechanism, and improves the efficiency and classification performance of the model. This research provides a new and efficient method for protein classification in the field of bioinformatics.