Kmer2SNP: Reference-Free Heterozygous SNP Calling Using k-mer Frequency Distributions

Yanbo Li,Hardip Patel,Yu Lin
DOI: https://doi.org/10.1007/978-1-0716-2293-3_16
Abstract:DNA sequencing technologies enable the generation of genetic profiles from many individuals at a rapid rate. Identifying single-nucleotide polymorphism (SNP) between biological samples is fundamental in genetics with various applications, such as disease diagnosis and associations and ancestry and relationship inference. Most methods use a species-specific reference genome for aligning raw sequenced reads for accurate SNP calling. However, high-quality reference genomes may not be available for all species. Therefore, we developed a reference-free algorithm, Kmer2SNP, to identify heterozygous SNPs from raw sequenced reads to facilitate genetic studies in species without the reference genome. Kmer2SNP first calculates the k-mer frequency distribution from reads to determine k-mers containing heterozygous SNPs. Next, these k-mers are rapidly matched with each other to identify pairs of exact heterozygous k-mers that belong to one of the two possible haplotypes in a diploid organism. Finally, using overlapping neighboring k-mers, weights are assigned for SNP assignments; higher weights increase SNP discovery confidence.
What problem does this paper attempt to address?