KmerGO: A Tool to Identify Group-Specific Sequences With k-mers

Ying Wang,Qi Chen,Chao Deng,Yiluan Zheng,Fengzhu Sun
DOI: https://doi.org/10.3389/fmicb.2020.02067
IF: 5.2
2020-08-25
Frontiers in Microbiology
Abstract:Capturing group-specific sequences between two groups of genomic/metagenomic sequences is critical for the follow-up identifications of singular nucleotide variants (SNVs), gene families, microbial species or other elements associated with each group. A sequence that is present, or rich, in one group, but absent, or scarce, in another group is considered a "group-specific" sequence in our study. We developed a user-friendly tool, KmerGO, to identify group-specific sequences between two groups of genomic/metagenomic long sequences or high-throughput sequencing datasets. Compared with other tools, KmerGO captures group-specific <i>k</i>-mers (<i>k</i> up to 40 bps) with much lower requirements for computing resources in much shorter running time. For a 1.05 TB dataset (.fasta), it takes KmerGO about 21.5 h (including <i>k</i>-mer counting) to return assembled group-specific sequences on a regular stand-alone workstation with no more than 1 GB memory. Furthermore, KmerGO can also be applied to capture trait-associated sequences for continuous-trait. Through multi-process parallel computing, KmerGO is implemented with both graphic user interface and command line on Linux and Windows free from any pre-installed supporting environments, packages, and complex configurations. The output group-specific <i>k</i>-mers or sequences from KmerGO could be the inputs of other tools for the downstream discovery of biomarkers, such as genetic variants, species, or genes. KmerGO is available at <a href="https://github.com/ChnMasterOG/KmerGO">https://github.com/ChnMasterOG/KmerGO</a>.
microbiology
What problem does this paper attempt to address?