A Survey of K-mer Methods and Applications in Bioinformatics
Camille Moeckel,Manvita Mareboina,Maxwell A. Konnaris,Candace S.Y. Chan,Ioannis Mouratidis,Austin Montgomery,Nikol Chantzi,Georgios A. Pavlopoulos,Ilias Georgakopoulos-Soares
DOI: https://doi.org/10.1016/j.csbj.2024.05.025
IF: 6.155
2024-05-22
Computational and Structural Biotechnology Journal
Abstract:The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences including nullomers and nullpeptides in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research. Graphical abstract Download : Download high-res image (155KB) Download : Download full-size image Fig. 1: Introduction to k-mers. A. All possible 2-mers, or k-mers with two nucleotides, are listed. In a specific DNA sequence, all 2-mers are recorded for frequency analysis. B. Nullomers, or possible 2-mers not in the genome, are counted by subtracting the observed 2-mers from all possible 2-mers. Nullpeptides are k-mers missing from proteomes. C. In a mutated sequence, neomers, or nullomers that resurface due to somatic mutations, can occur. AA is a neomer in this mutated sequence. D. When analyzing multiple genomes or sequences, primes, k-mers not present in any of the sequences, can be identified. There is one prime (CC) in these three sequences. Quasi-primes, or k-mers that only occur in one sequence (AA), can be identified.
biochemistry & molecular biology