Fleximer: Accurate Quantification of RNA-Seq via Variable-Length k-mers

Chelsea J.-T. Ju,Ruirui Li,Zhengliang Wu,Jyun-Yu Jiang,Zhao Yang,Wei Wang
DOI: https://doi.org/10.1145/3107411.3107444
2017-01-01
Abstract:The advent of RNA-Seq has made it possible to quantify transcript expression on a large scale simultaneously. This technology generates small fragments of each transcript sequence, known as sequencing reads. As the first step of data analysis towards expression quantification, most of the existing methods align these reads to a reference genome or transcriptome to establish their origins. However, read alignment is computationally costly. Recently, a series of methods have been proposed to perform a lightweight quantification analysis in an alignment-free manner. These methods utilize the notion of k-mers, which are short consecutive sequences representing the signatures of each transcript, to estimate the relative abundance from RNA-Seq reads. Current k-mer based approaches make use of a set of fixed size k-mers; however, the true signatures of each transcript may not exist in a fixed size. In this paper, we demonstrate the importance of k-mers selection in transcript abundance estimation. We propose a novel method, Fleximer, to efficiently discover and select an optimal set of k-mers with flexible lengths. Using both simulated and real datasets, we show that, with fewer k-mers, Fleximer is able to cover the similar amount of reads as Sailfish and Kallisto. The selected k-mers own more distinguishing features, and thus substantially reduce the errors in transcript abundance estimation.
What problem does this paper attempt to address?