BERTE: High-precision hierarchical classification of transposable elements by a transfer learning method with BERT pre-trained model and convolutional neural network

Yiqi Chen,Yang Qi,Yingfu Wu,Fuhao Zhang,Xingyu Liao,Xuequn Shang
DOI: https://doi.org/10.1101/2024.01.28.577612
2024-01-31
Abstract:Transposable Elements (TEs) are abundant repeat sequences found in living organisms. They play a pivotal role in biological evolution and gene regulation and are intimately linked to human diseases. Existing TE classification tools can classify classes, orders, and superfamilies concurrently, but they often struggle to effectively extract sequence features. This limitation frequently results in subpar classification results, especially in hierarchical classification. To tackle this problem, we introduced BERTE, a tool for TE hierarchical classification. BERTE encoded TE sequences into distinctive features that consisted of both attentional and cumulative frequency information. By leveraging the multi-head self-attention mechanism of the pre-trained BERT model, BERTE transformed sequences into attentional features. Additionally, we calculated multiple frequency vectors and concatenate them to form cumulative features. Following feature extraction, a parallel Convolutional Neural Network (CNN) model was employed as an efficient sequence classifier, capitalizing on its capability for high-dimensional feature transformation. We evaluated BERTE’s performance on filtered datasets collected from 12 eukaryotic databases. Experimental results demonstrated that BERTE could improve the F1-score at different levels by up to 21% compared to current state-of-the-art methods. Furthermore, the results indicated that not only could BERT better characterize TE sequences in feature extraction, but also that CNN was more efficient than other popular deep learning classifiers. In general, BERTE classifies TE sequences with greater precision. BERTE is available at .
Bioinformatics
What problem does this paper attempt to address?
The paper aims to address the issues of accuracy and feature extraction in the classification of Transposable Elements (TEs). Although existing TE classification tools are capable of categorizing TEs by class, order, and superfamily simultaneously, they are limited in effectively extracting sequence features, which often leads to poor performance in hierarchical classification. To tackle this problem, researchers have developed a new tool named BERTE, which is based on the pre-trained BERT model and Convolutional Neural Networks (CNN) for hierarchical classification of TEs. BERTE encodes TE sequences into unique features that include attention and cumulative k-mer frequency information. By leveraging the multi-head self-attention mechanism of the pre-trained BERT model, BERTE is able to transform sequences into attention features. Additionally, researchers calculated multiple k-mer frequency vectors and concatenated them to form cumulative features. After feature extraction, a parallel CNN model is employed as an efficient sequence classifier, utilizing its capability for high-dimensional feature transformation. The performance of BERTE was evaluated on filtered datasets collected from 12 eukaryotic databases, and the experimental results showed that BERTE improved the F1 score by up to 21% at different hierarchical levels compared to the current state-of-the-art methods. Moreover, the results also indicated that BERT not only better represents TE sequences in feature extraction but also that CNN is more efficient than other popular deep learning classifiers. Overall, BERTE classifies TE sequences with higher accuracy.