Abstract:BACKGROUND:Most existing methods for phylogenetic analysis involve developing an evolutionary model and then using some type of computational algorithm to perform multiple sequence alignment. There are two problems with this approach: (1) different evolutionary models can lead to different results, and (2) the computation time required for multiple alignments makes it impossible to analyse the phylogeny of a whole genome. This motivates us to create a new approach to characterize genetic sequences.METHODOLOGY:To each DNA sequence, we associate a natural vector based on the distributions of nucleotides. This produces a one-to-one correspondence between the DNA sequence and its natural vector. We define the distance between two DNA sequences to be the distance between their associated natural vectors. This creates a genome space with a biological distance which makes global comparison of genomes with same topology possible. We use our proposed method to analyze the genomes of the new influenza A (H1N1) virus, human rhinoviruses (HRV) and mammalian mitochondrial. The result shows that a triple-reassortant swine virus circulating in North America and the Eurasian swine virus belong to the lineage of the influenza A (H1N1) virus. For the HRV and mammalian mitochondrial genomes, the results coincide with biologists' analyses.CONCLUSIONS:Our approach provides a powerful new tool for analyzing and annotating genomes and their phylogenetic relationships. Whole or partial genomes can be handled more easily and more quickly than using multiple alignment methods. Once a genome space has been constructed, it can be stored in a database. There is no need to reconstruct the genome space for subsequent applications, whereas in multiple alignment methods, realignment is needed to add new sequences. Furthermore, one can make a global comparison of all genomes simultaneously, which no other existing method can achieve.

Efficient computation of shortest absent words in a genomic sequence

Gene Sequence Alignment on a Public Computing Platform

The Bulk and The Tail of Minimal Absent Words in Genome Sequences

An algorithm for rapid noncoding RNA sequence-structure alignment

Efficient privacy-preserving variable-length substring match for genome sequence

Reference-based genome compression using the longest matched substrings with parallelization consideration

Efficient Analysis of Annotation Colocalization Accounting for Genomic Contexts

DNA sequences alignment method using sparse index on pan-genome graph

Detecting Differentially Expressed Genes by Smoothing Effect of Gene Length on Variance Estimation

A DNA Sequence Alignment Tool Based on BWA and Data Mining

A Fast Exact Repeats Search Algorithm for Genome Analysis.

keeSeek: searching distant non-existing words in genomes for PCR-based applications

Perm: Efficient Mapping of Short Sequencing Reads with Periodic Full Sensitive Spaced Seeds

Efficient Approach to Correct Read Alignment for Pseudogene Abundance Estimates.

CGAP-align: a high performance DNA short read alignment tool.

Computational Genome Analysis

Pfp-fm: an accelerated FM-index

A high-throughput gene sequence alignment strategy using parallel computing

A Novel Method of Characterizing Genetic Sequences: Genome Space with Biological Distance and Applications.

Quantifying and Mitigating Computational Inefficiency of Genomics Data Analysis

A Parallel Implementation for Determining Genomic Distances under Deletion and Insertion.