Similarity analysis of DNA sequences through local distribution of nucleotides in strategic neighborhood

Probir Mondal,Pratyay Banerjee,Debranjan Pal,Krishnendu Basuli
2024-09-19
Abstract:We propose a new alignment-free algorithm by constructing a compact vector representation on $\mathbb{R}^{24}$ of a DNA sequence of arbitrary length. Each component of this vector is obtained from a representative sequence, the elements of which are the values realized by a function $\Gamma$. This function $\Gamma$ acts on neighborhoods of arbitrary radius that are located at strategic positions within the DNA sequence and carries complete information about the local distribution of frequencies of the nucleotides as a consequence of the uniqueness of prime factorization of integer. The algorithm exhibits linear time complexity and turns out to consume significantly small memory. The two natural parameters characterizing the radius and location of the neighbourhoods are fixed by comparing the phylogenetic tree with the benchmark for full genome sequences of fish mtDNA datasets. Using these fitting parameters, the method is applied to analyze a number of genome sequences from benchmark and other standard datasets. Our algorithm proves to be computationally efficient compared to other well known algorithms when applied on simulated dataset.
Data Structures and Algorithms
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper proposes a new alignment-free algorithm by constructing a compact vector representation to represent DNA sequences of arbitrary lengths. This representation allows for efficient comparison of DNA sequences of different lengths and features linear time complexity and low memory consumption. Specifically, the main objectives of the algorithm include: 1. **Efficient Comparison of DNA Sequences of Different Lengths**: - Existing alignment-based algorithms typically have quadratic time complexity when dealing with long sequences, making them inefficient. The alignment-free algorithm proposed in the paper aims to reduce runtime through compact representation. 2. **Representation of Local Nucleotide Distribution**: - By considering the local nucleotide distribution at specific positions in the DNA sequence, the algorithm can capture important information within the sequences, thereby improving similarity analysis. 3. **Linear Time and Low Memory Consumption**: - The algorithm features linear time complexity and significantly reduced memory consumption, making it more efficient for handling large-scale datasets. 4. **Validation of Algorithm Effectiveness**: - By testing on fish mtDNA datasets, the paper demonstrates the effectiveness and accuracy of the algorithm in practical applications. Through these methods, the algorithm not only improves the efficiency of DNA sequence comparison but also maintains good performance when handling sequences of different lengths.