Abstract:Multiple sequence alignment (MSA) is a prominent method for classification of DNA sequences, yet it is hampered with inherent limitations in computational complexity. Alignment-free methods have been developed over past decade for more efficient comparison and classification of DNA sequences than MSA. However, most alignment-free methods may lose structural and functional information of DNA sequences because they are based on feature extractions. Therefore, they may not fully reflect the actual differences among DNA sequences. Alignment-free methods with information conservation are needed for more accurate comparison and classification of DNA sequences. We propose a new alignment-free similarity measure of DNA sequences using the Discrete Fourier Transform (DFT). In this method, we map DNA sequences into four binary indicator sequences and apply DFT to the indicator sequences to transform them into frequency domain. The Euclidean distance of full DFT power spectra of the DNA sequences is used as similarity distance metric. To compare the DFT power spectra of DNA sequences with different lengths, we propose an even scaling method to extend shorter DFT power spectra to equal the longest length of the sequences compared. After the DFT power spectra are evenly scaled, the DNA sequences are compared in the same DFT frequency space dimensionality. We assess the accuracy of the similarity metric in hierarchical clustering using simulated DNA and virus sequences. The results demonstrate that the DFT based method is an effective and accurate measure of DNA sequence similarity.

A measure of DNA sequence dissimilarity based on free energy of nearest-neighbor interaction.

A new measure for similarity searching in DNA sequences

Three distances for rapid similarity analysis of DNA sequences

A Novel Method for Similarity/dissimilarity Analysis of Protein Sequences

Similarity Evaluation of DNA Sequences Based on Frequent Patterns and Entropy

A Measure of Dna Sequence Similarity by Fourier Transform with Applications on Hierarchical Clustering

An Efficient Binomial Model-Based Measure for Sequence Comparison and Its Application

A Method for Measuring Protein Structure Similarity Based on the Molecular Inner Spatial Density Distribution

A Novel Measurement of Sequence Dissimilarity and Its Application to Phylogeny

Similarity/dissimilarity calculation methods of DNA sequences: A survey

Analysis of Similarity/dissimilarity of DNA Sequences Based on a Class of 2D Graphical Representation

Comparisons of DNA sequences based on dinucleotide

Similarity analysis of DNA sequences based on the relative entropy

A Novel Technique for Analyzing the Similarity and Dissimilarity of DNA Sequences

Analysis of Similarity of DNA Sequences Based on a Measure of Information Discrepancy

A Similarity Computing Algorithm for Proteins

A Measure For Sequence Similarity Based On Dual Nucleotides And Information Discrepancy

Analysis of Similarity/dissimilarity of DNA Sequences Based on a 3-D Graphical Representation

Numerical Characteristics of Word Frequencies and Their Application to Dissimilarity Measure for Sequence Comparison.

Similarity/dissimilarity studies of protein sequences based on a new 2D graphical representation.

Similarity Analysis of DNA Sequences Based on Average Mutual Information