Abstract:The comparison of DNA sequences is of great significance in genomics analysis. Although the traditional multiple sequence alignment (MSA) method is popularly used for evolutionary analysis, optimally aligning k sequences becomes computationally intractable when k increases due to the intrinsic computational complexity of MSA. Despite numerous k-mer alignment-free methods being proposed, the existing k-mer alignment-free methods may not truly capture the contextual structures of the sequences. In this study, we present a novel k-mer contextual alignment-free method (called kmer2vec), in which the sequence k-mers are semantically embedded to word2vec vectors, an essential technique in natural language processing. Consequently, the method converts each DNA/RNA sequence into a point in the word2vec high-dimensional space and compares DNA sequences in the space. Because the word2vec vectors are trained from the contextual relationship of k-mers in the genomes, the method may extract valuable structural information from the sequences and reflect the relationship among them properly. The proposed method is optimized on the parameters from word2vec training and verified in the phylogenetic analysis of large whole genomes, including coronavirus and bacterial genomes. The results demonstrate the effectiveness of the method on phylogenetic tree construction and species clustering. The method running speed is much faster than that of the MSA method, especially the phylogenetic relationships constructed by the kmer2vec method are more accurate than the conventional k-mer alignment-free method. Therefore, this approach can provide new perspectives for phylogeny and evolution and make it possible to analyze large genomes. In addition, we discuss special parameterization in the k-mer word2vec embedding construction. An effective tool for rapid SARS-CoV-2 typing can also be derived when combining kmer2vec with clustering methods.

Similarity analysis of DNA sequences based on k-word

Kmer2vec: A Novel Method for Comparing DNA Sequences by Word2vec Embedding

Similarity Evaluation of DNA Sequences Based on Frequent Patterns and Entropy

Structure Matrix Based Similarity Model for DNA Sequences

A Novel Model for DNA Sequence Similarity Analysis Based on Graph Theory

Clustering DNA Sequences by Feature Vectors

DNA sequence comparison by a novel probabilistic method

A Measure of Dna Sequence Similarity by Fourier Transform with Applications on Hierarchical Clustering

A New Distribution Vector and Its Application in Genome Clustering.

K-mer Natural Vector and Its Application to the Phylogenetic Analysis of Genetic Sequences.

Nucleotide Amino Acid K-Mer Vector: an Alignment-Free Method for Comparing Genomic Sequences

A novel fast vector method for genetic sequence comparison

A Novel Method of Characterizing Genetic Sequences: Genome Space with Biological Distance and Applications.

Positional Correlation Natural Vector: A Novel Method for Genome Comparison.

Two Dimensional Yau-Hausdorff Distance with Applications on Comparison of DNA and Protein Sequences

An efficient parallel algorithm for multiple sequence similarities calculation using a low complexity method.

An advanced approach for DNA sequencing and similarities analysis on the basis of groupings of nucleotide bases

A Novel Method for Comparative Analysis of DNA Sequences by Ramanujan-Fourier Transform

An efficient numerical representation of genome sequence: natural vector with covariance component

A New Efficient Method for Analyzing Fungi Species Using Correlations Between Nucleotides

A New Method to Cluster Genomes Based on Cumulative Fourier Power Spectrum.