A frame-based representation of genomic sequences for removing errors and rare variant detection in NGS data

Raunaq Malhotra,Manjari Mukhopadhyay,Mary Poss,Raj Acharya
DOI: https://doi.org/10.48550/arXiv.1604.04803
2016-04-17
Abstract:We propose a frame-based representation of k-mers for detecting sequencing errors and rare variants in next generation sequencing data obtained from populations of closely related genomes. Frames are sets of non-orthogonal basis functions, traditionally used in signal processing for noise removal. We define a frame for genomes and sequenced reads to consist of discrete spatial signals of every k-mer of a given size. We show that each k-mer in the sequenced data can be projected onto multiple frames and these projections are maximized for spatial signals corresponding to the k-mer's substrings. Our proposed classifier, MultiRes, is trained on the projections of k-mers as features used for marking k-mers as erroneous or true variations in the genome. We evaluate MultiRes on simulated and real viral population datasets and compare it to other error correction methods known in the literature. MultiRes has 4 to 500 times less false positives k-mer predictions compared to other methods, essential for accurate estimation of viral population diversity and their de-novo assembly. It has high recall of the true k-mers, comparable to other error correction methods. MultiRes also has greater than 95% recall for detecting single nucleotide polymorphisms (SNPs), fewer false positive SNPs, while detecting higher number of rare variants compared to other variant calling methods for viral populations. The software is freely available from the GitHub link (<a class="link-external link-https" href="https://github.com/raunaq-m/MultiRes" rel="external noopener nofollow">this https URL</a>).
Genomics
What problem does this paper attempt to address?