Abstract:Bioinformatics is a rapidly developing field enabling scientific experiments via computer models and simulations. In recent years, there has been an extraordinary growth in biological databases. Therefore, it is extremely important to propose effective methods and algorithms for the fast and accurate processing of biological data. Sequence comparisons are the best way to investigate and understand the biological functions and evolutionary relationships between genes on the basis of the alignment of two or more DNA sequences in order to maximize the identity level and degree of similarity. This paper presents a new version of the pairwise DNA sequences alignment algorithm, based on a new method called CAT, where a dependency with a previous match and the closest neighbor are taken into consideration to increase the uniqueness of the CAT profile and to reduce possible collisions, i.e., two or more sequence with the same CAT profiles. This makes the proposed algorithm suitable for finding the exact match of a concrete DNA sequence in a large set of DNA data faster. In order to enable the usage of the profiles as sequence metadata, CAT profiles are generated once prior to data uploading to the database. The proposed algorithm consists of two main stages: CAT profile calculation depending on the chosen benchmark sequences and sequence comparison by using the calculated CAT profiles. Improvements in the generation of the CAT profiles are detailed and described in this paper. Block schemes, pseudo code tables, and figures were updated according to the proposed new version and experimental results. Experiments were carried out using the new version of the CAT method for DNA sequence alignment and different datasets. New experimental results regarding collisions, speed, and efficiency of the suggested new implementation are presented. Experiments related to the performance comparison with Needleman–Wunsch were re-executed with the new version of the algorithm to confirm that we have the same performance. A performance analysis of the proposed algorithm based on the CAT method against the Knuth–Morris–Pratt algorithm, which has a complexity of O(n) and is widely used for biological data searching, was performed. The impact of prior matching dependencies on uniqueness for generated CAT profiles is investigated. The experimental results from sequence alignment demonstrate that the proposed CAT method-based algorithm exhibits minimal deviation, which can be deemed negligible if such deviation is considered permissible in favor of enhanced performance. It should be noted that the performance of the CAT algorithm in terms of execution time remains stable, unaffected by the length of the analyzed sequences. Hence, the primary benefit of the suggested approach lies in its rapid processing capabilities in large-scale sequence alignment, a task that traditional exact algorithms would require significantly more time to perform.

Nucleotide Amino Acid K-Mer Vector: an Alignment-Free Method for Comparing Genomic Sequences

A novel fast vector method for genetic sequence comparison

Positional Correlation Natural Vector: A Novel Method for Genome Comparison.

An efficient numerical representation of genome sequence: natural vector with covariance component

K-mer Natural Vector and Its Application to the Phylogenetic Analysis of Genetic Sequences.

Kmer2vec: A Novel Method for Comparing DNA Sequences by Word2vec Embedding

Phylogenetic Analysis of Protein Sequences Based on a Novel K-Mer Natural Vector Method

A Novel Approach to Clustering Genome Sequences Using Inter-nucleotide Covariance

A Novel Method of Characterizing Genetic Sequences: Genome Space with Biological Distance and Applications.

A Novel Alignment-Free Vector Method to Cluster Protein Sequences

Classification of Protein Sequences by a Novel Alignment-Free Method on Bacterial and Virus Families

Similarity analysis of DNA sequences through local distribution of nucleotides in strategic neighborhood

The role of WHO in the control of nutritional anaemia.

DNA sequence comparison by a novel probabilistic method

An advanced approach for DNA sequencing and similarities analysis on the basis of groupings of nucleotide bases

Optimization and Performance Analysis of CAT Method for DNA Sequence Similarity Searching and Alignment

An efficient parallel algorithm for multiple sequence similarities calculation using a low complexity method.

Test–retest reproducibility of the Assessment of Motor and Process Skills for school-aged children with acquired brain injuries

Success of alignment-free oligonucleotide (k-mer) analysis confirms relative importance of genomes not genes in speciation and phylogeny

A New Method Based on Coding Sequence Density to Cluster Bacteria.

A New Efficient Method for Analyzing Fungi Species Using Correlations Between Nucleotides