Abstract:Background: Repetitive elements contribute a large part of eukaryotic genomes. For example, about 40 to 50% of human, mouse and rat genomes are repetitive. So identifying and classifying repeats is an important step in genome annotation. This annotation step is traditionally performed using alignment based methods, either in a de novo approach or by aligning the genome sequence to a species specific set of repetitive sequences. Recently, Li (Bioinformatics 35:4408-4410, 2019) developed a novel software tool dna-brnn to annotate repetitive sequences using a recurrent neural network trained on sample annotations of repetitive elements. Results: We have developed the methods of dna-brnn further and engineered a new software tool DeepGRP. This combines the basic concepts of Li (Bioinformatics 35:4408-4410, 2019) with current techniques developed for neural machine translation, the attention mechanism, for the task of nucleotide-level annotation of repetitive elements. An evaluation on the human genome shows a 20% improvement of the Matthews correlation coefficient for the predictions delivered by DeepGRP, when compared to dna-brnn. DeepGRP predicts two additional classes of repeats (compared to dna-brnn) and is able to transfer repeat annotations, using RepeatMasker-based training data to a different species (mouse). Additionally, we could show that DeepGRP predicts repeats annotated in the Dfam database, but not annotated by RepeatMasker. DeepGRP is highly scalable due to its implementation in the TensorFlow framework. For example, the GPU-accelerated version of DeepGRP is approx. 1.8 times faster than dna-brnn, approx. 8.6 times faster than RepeatMasker and over 100 times faster than HMMER searching for models of the Dfam database. Conclusions: By incorporating methods from neural machine translation, DeepGRP achieves a consistent improvement of the quality of the predictions compared to dna-brnn. Improved running times are obtained by employing TensorFlow as implementation framework and the use of GPUs. By incorporating two additional classes of repeats, DeepGRP provides more complete annotations, which were evaluated against three state-of-the-art tools for repeat annotation.

Sequence Repetitiveness Quantification and De Novo Repeat Detection by Weighted K-Mer Coverage.

A New Statistic for Efficient Detection of Repetitive Sequences

Diffreps: Detecting Differential Chromatin Modification Sites from ChIP-seq Data with Biological Replicates.

Accurate Detection of Tandem Repeats from Error-Prone Sequences with EquiRep

A Fuzzy sequencer for rapid DNA fragment counting and genotyping

Exploiting protein language model sequence representations for repeat detection

DnaReSM: A Multi-Supports-based DNA Repetitive Sequences Mining Algorithm

Identification of repeats in DNA sequences using nucleotide distribution uniformity

Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences

DNA Origami-Enabled Gene Localization of Repetitive Sequences

ULTRA-Effective Labeling of Repetitive Genomic Sequence

De novo identification of replication-timing domains in the human genome by deep learning.

Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor

Efficient Search of Circular Repeats and MicroDNA Reintegration in DNA Sequences.

RecombineX: A generalized computational framework for automatic high-throughput gamete genotyping and tetrad-based recombination analysis

A Fast Exact Repeats Search Algorithm for Genome Analysis.

RepeatFiller newly identifies megabases of aligning repetitive sequences and improves annotations of conserved non-exonic elements

DeepGRP: engineering a software tool for predicting genomic repetitive elements using Recurrent Neural Networks with attention

Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects

Assembly of repetitive regions using next-generation sequencing data

Kmer2SNP: Reference-Free Heterozygous SNP Calling Using k-mer Frequency Distributions