Abstract:Background: A large number of researchers have devoted to accelerating the speed of genome sequencing and reducing the cost of genome sequencing for decades, and they have made great strides in both areas, making it easier for researchers to study and analyze genome data. However, how to efficiently store and transmit the vast amount of genome data generated by high-throughput sequencing technologies has become a challenge for data compression researchers. Therefore, the research of genome data compression algorithms to facilitate the efficient representation of genome data has gradually attracted the attention of these researchers. Meanwhile, considering that the current computing devices have multiple cores, how to make full use of the advantages of the computing devices and improve the efficiency of parallel processing is also an important direction for designing genome compression algorithms. Results: We proposed an algorithm (LMSRGC) based on reference genome sequences, which uses the suffix array (SA) and the longest common prefix (LCP) array to find the longest matched substrings (LMS) for the compression of genome data in FASTA format. The proposed algorithm utilizes the characteristics of SA and the LCP array to select all appropriate LMSs between the genome sequence to be compressed and the reference genome sequence and then utilizes LMSs to compress the target genome sequence. To speed up the operation of the algorithm, we use GPUs to parallelize the construction of SA, while using multiple threads to parallelize the creation of the LCP array and the filtering of LMSs. Conclusions: Experiment results demonstrate that our algorithm is competitive with the current state-of-the-art algorithms in compression ratio and compression time.

A Fast Longest Common Subsequence Algorithm for Biosequences Alignment

Efficient Algorithms for Finding a Longest Common Increasing Subsequence

A Fast Exact Pattern Matching Algorithm for Biological Sequences

A Fast Improved Pattern Matching Algorithm for Biological Sequences

Gene Sequence Alignment on a Public Computing Platform

On the Complexity of Constrained Sequences Alignment Problems.

Constrained Pairwise and Center-Star Sequences Alignment Problems

Efficient algorithms for the longest common subsequence in $k$-length substrings

Parallel linear space algorithm for large-scale sequence alignment

An average-case efficient two-stage algorithm for enumerating all longest common substrings of minimum length k between genome pairs

An algorithm for rapid noncoding RNA sequence-structure alignment

A novel fast multiple nucleotide sequence alignment method based on FM-index

Algorithms For Loosely Constrained Multiple Sequence Alignment

hLCS. A Hybrid GPGPU Approach for Solving Multiple Short and Unbalanced LCS Problems

The colored longest common prefix array computed via sequential scans

Algorithms for the Uniqueness of the Longest Common Subsequence

Reference-based genome compression using the longest matched substrings with parallelization consideration

Efficient Parallel Algorithm for Optimal Three-Sequences Alignment

An efficient parallel algorithm for multiple sequence similarities calculation using a low complexity method.

fastMSA: Accelerating Multiple Sequence Alignment with Dense Retrieval on Protein Language

Parallel Three-sequence Alignment with Space-efficient