De Novo Repeat Detection Based on the Third Generation Sequencing Reads

Xingyu Liao,Xiankai Zhang,Fang-Xiang Wu,Jianxin Wang
DOI: https://doi.org/10.1109/bibm47256.2019.8982959
2019-01-01
Abstract:Repetitive sequences refer to fragments that appear at more than one location in a genome. Numerous studies have shown that the repetitive sequences in genomes play indispensable roles in the evolution, inheritance, variation, gene expression, transcriptional regulation, chromosome construction, and physiological metabolism of organisms. In many sequence and genome analyses such as read alignment, de novo assembly and genome annotation, repetitive sequences can pose major challenges. Detection and classification of repeats is one of the main steps for genome sequence analysis in bioinformatics. However, most existing de novo detection methods are difficult to achieve satisfactory results for marking repetitive regions in both size and accuracy due to the NGS reads are too short to identify long repeats and the raw SMS long reads are with the high error rates. In this study, we present a new de novo repeat detection method called DLR (Detection of Long Repeats) based on PacBio long reads. DLR first converts all long reads into unique k-mers of a certain length, and screens out the k-mers with the high frequency. Then, these high frequency k-mers are aligned to long reads by using multiple sequence alignment, and the high frequency regions on long reads that are covered by those high frequency k-mers are recorded. Finally, the recorded high frequency regions with inclusion relations are merged and the final repetitive sequences are obtained. The experimental results show that DLR achieves optimal results in terms of effective size and accuracy compared with other existing algorithms.
What problem does this paper attempt to address?