BERTE: High-precision hierarchical classification of transposable elements by a transfer learning method with BERT pre-trained model and convolutional neural network

Yiqi Chen,Yang Qi,Yingfu Wu,Fuhao Zhang,Xingyu Liao,Xuequn Shang

DOI: https://doi.org/10.1101/2024.01.28.577612

2024-01-31

Abstract:Transposable Elements (TEs) are abundant repeat sequences found in living organisms. They play a pivotal role in biological evolution and gene regulation and are intimately linked to human diseases. Existing TE classification tools can classify classes, orders, and superfamilies concurrently, but they often struggle to effectively extract sequence features. This limitation frequently results in subpar classification results, especially in hierarchical classification. To tackle this problem, we introduced BERTE, a tool for TE hierarchical classification. BERTE encoded TE sequences into distinctive features that consisted of both attentional and cumulative frequency information. By leveraging the multi-head self-attention mechanism of the pre-trained BERT model, BERTE transformed sequences into attentional features. Additionally, we calculated multiple frequency vectors and concatenate them to form cumulative features. Following feature extraction, a parallel Convolutional Neural Network (CNN) model was employed as an efficient sequence classifier, capitalizing on its capability for high-dimensional feature transformation. We evaluated BERTE’s performance on filtered datasets collected from 12 eukaryotic databases. Experimental results demonstrated that BERTE could improve the F1-score at different levels by up to 21% compared to current state-of-the-art methods. Furthermore, the results indicated that not only could BERT better characterize TE sequences in feature extraction, but also that CNN was more efficient than other popular deep learning classifiers. In general, BERTE classifies TE sequences with greater precision. BERTE is available at .

Bioinformatics

What problem does this paper attempt to address?

The paper aims to address the issues of accuracy and feature extraction in the classification of Transposable Elements (TEs). Although existing TE classification tools are capable of categorizing TEs by class, order, and superfamily simultaneously, they are limited in effectively extracting sequence features, which often leads to poor performance in hierarchical classification. To tackle this problem, researchers have developed a new tool named BERTE, which is based on the pre-trained BERT model and Convolutional Neural Networks (CNN) for hierarchical classification of TEs. BERTE encodes TE sequences into unique features that include attention and cumulative k-mer frequency information. By leveraging the multi-head self-attention mechanism of the pre-trained BERT model, BERTE is able to transform sequences into attention features. Additionally, researchers calculated multiple k-mer frequency vectors and concatenated them to form cumulative features. After feature extraction, a parallel CNN model is employed as an efficient sequence classifier, utilizing its capability for high-dimensional feature transformation. The performance of BERTE was evaluated on filtered datasets collected from 12 eukaryotic databases, and the experimental results showed that BERTE improved the F1 score by up to 21% at different hierarchical levels compared to the current state-of-the-art methods. Moreover, the results also indicated that BERT not only better represents TE sequences in feature extraction but also that CNN is more efficient than other popular deep learning classifiers. Overall, BERTE classifies TE sequences with higher accuracy.

BERTE: High-precision hierarchical classification of transposable elements by a transfer learning method with BERT pre-trained model and convolutional neural network

Comprehensive Hierarchical Classification of Transposable Elements based on Deep Learning

Retrotransposons in Plant Genomes: Structure, Identification, and Classification through Bioinformatics and Machine Learning

NeuralTE: an accurate approach for Transposable Element superfamily classification with multi-feature fusion

DeepTE: a Computational Method for De Novo Classification of Transposons with Convolutional Neural Network

Machine Learning based Prediction of Hierarchical Classification of Transposable Elements

Computational Approaches for Identification and Classification of Transposable Elements in Eukaryotic Genomes

TEfinder: A Bioinformatics Pipeline for Detecting New Transposable Element Insertion Events in Next-Generation Sequencing Data

HiTE: a Fast and Accurate Dynamic Boundary Adjustment Approach for Full-Length Transposable Element Detection and Annotation

BarcodeBERT: Transformers for Biodiversity Analysis

HBert: A Long Text Processing Method Based on BERT and Hierarchical Attention Mechanisms

Classification of LTR Retrotransposons via Interaction Prediction

BERT-TFBS: a novel BERT-based model for predicting transcription factor binding sites by transfer learning

Application of BERT to Enable Gene Classification Based on Clinical Evidence.

Orthoptera-TElib: a library of Orthoptera transposable elements for TE annotation

Accelerating RepeatClassifier Based on Spark and Greedy Algorithm with Dynamic Upper Boundary

TEtrimmer: a novel tool to automate the manual curation of transposable elements

RepeatModeler2: automated genomic discovery of transposable element families

RelocaTE2: a High Resolution Transposable Element Insertion Site Mapping Tool for Population Resequencing.

BiRNA-BERT Allows Efficient RNA Language Modeling with Adaptive Tokenization

Particular sequence characteristics induce bias in the detection of polymorphic transposable element insertions