Abstract:Background: Long terminal repeats (LTRs) represent important parts of LTR retrotransposons and retroviruses found in high copy numbers in a majority of eukaryotic genomes. LTRs contain regulatory sequences essential for the life cycle of the retrotransposon. Previous experimental and sequence studies have provided only limited information about LTR structure and composition, mostly from model systems. To enhance our understanding of these key compounds, we focused on the contrasts between LTRs of various retrotransposon families and other genomic regions. Furthermore, this approach can be utilized for the classification and prediction of LTRs. Results: We used machine learning methods suitable for DNA sequence classification and applied them to a large dataset of plant LTR retrotransposon sequences. We trained three machine learning models using (i) traditional model ensembles (Gradient Boosting - GBC), (ii) hybrid CNN-LSTM models, and (iii) a pre-trained transformer-based model (DNABERT) using k-mer sequence representation. All three approaches were successful in classifying and isolating LTRs in this data, as well as providing valuable insights into LTR sequence composition. The best classification (expressed as F1 score) achieved for LTR detection was 0.85 using the CNN-LSTM hybrid network model. The most accurate classification task was superfamily classification (F1=0.89) while the least accurate was family classification (F1=0.74). The trained models were subjected to explainability analysis. SHAP positional analysis identified a mixture of interesting features, many of which had a preferred absolute position within the LTR and/or were biologically relevant, such as a centrally positioned TATA-box, and TG..CA patterns around both LTR edges. Conclusions: Our results show that the models used here recognized biologically relevant motifs, such as core promoter elements in the LTR detection task, and a development and stress-related subclass of transcription factor binding sites in the family classification task. Explainability analysis also highlighted the importance of 5'- and 3'- edges in LTR identity and revealed need to analyze more than just dinucleotides at these ends. Our work shows the applicability of machine learning models to regulatory sequence analysis and classification, and demonstrates the important role of the identified motifs in LTR detection.

DNABERT-based explainable lncRNA identification in plant genome assemblies

PlncRNADB: A Repository of Plant Lncrnas and Lncrna-Rbp Protein Interactions

A Hybrid Prediction Method for Plant lncRNA-Protein Interaction.

Functional Characterization of Plant Small RNAs Based on Next-Generation Sequencing Data

Uncovering DCL1-dependent Small RNA Loci on Plant Genomes: a Structure-Based Approach.

A Reversed Framework for the Identification of Microrna-Target Pairs in Plants.

LncLSTA: A Versatile Predictor Unveiling Subcellular Localization of Lncrnas Through Long-Short Term Attention

Discovering putative peptides encoded from non-coding RNAs in ribosome profiling data of Arabidopsis thaliana.

Identification, characterization and transcriptional analysis of the long non-coding RNA landscape in the family

RNA Regulatory Networks in Animals and Plants: a Long Noncoding RNA Perspective.

Reference-Based Identification of Long Noncoding RNAs in Plants with Strand-Specific RNA-Sequencing Data.

ItLnc-BXE: a Bagging-XGBoost-ensemble method with multiple features for identification of plant lncRNAs

Systematic Identification of Long Non-Coding RNAs During Pollen Development and Fertilization in Brassica Rapa

Long Non-Coding Rna A Novel Endogenous Source for the Generation of Dicer-Like 1-Dependent Small Rnas in Arabidopsis Thaliana

Computational Prediction of Novel Non-Coding RNAs in Arabidopsis Thaliana.

Plant long non-coding RNAs: identification and analysis to unveil their physiological functions

PreLnc: An Accurate Tool for Predicting lncRNAs Based on Multiple Features

Single-cell transcriptome analysis dissects lncRNA-associated gene networks in Arabidopsis

MILNP: Plant lncRNA–miRNA Interaction Prediction Based on Improved Linear Neighborhood Similarity and Label Propagation

Plant miRNA–lncRNA Interaction Prediction with the Ensemble of CNN and IndRNN

Detection and classification of long terminal repeat sequences in plant LTR-retrotransposons and their analysis using explainable machine learning.