Effect of k-tuple length on sample-comparison with high-throughput sequencing data

Ying Wang,Xiaoye Lei,Shun Wang,Zicheng Wang,Nianfeng Song,Feng Zeng,Ting Chen
DOI: https://doi.org/10.1016/j.bbrc.2015.11.094
IF: 3.1
2016-01-01
Biochemical and Biophysical Research Communications
Abstract:The high-throughput metagenomic sequencing offers a powerful technique to compare the microbial communities. Without requiring extra reference sequences, alignment-free models with short k-tuple (k = 2–10 bp) yielded promising results. Short k-tuples describe the overall statistical distribution, but is hard to capture the specific characteristics inside one microbial community. Longer k-tuple contains more abundant information. However, because the frequency vector of long k-tuple(k ≥ 30 bp) is sparse, the statistical measures designed for short k-tuples are not applicable. In our study, we considered each tuple as a meaningful word and then each sequencing data as a document composed of the words. Therefore, the comparison between two sequencing data is processed as “topic analysis of documents” in text mining. We designed a pipeline with long k-tuple features to compare metagenomic samples combined using algorithms from text mining and pattern recognition. The pipeline is available at http://culotuple.codeplex.com/.
What problem does this paper attempt to address?