Efficient computation of shortest absent words in a genomic sequence

Zong-Da Wu,Tao Jiang,Wu-Jie Su
DOI: https://doi.org/10.1016/j.ipl.2010.05.008
IF: 0.851
2010-01-01
Information Processing Letters
Abstract:Analyzing sequence composition is a basic task in genomic research. In this paper, to efficiently compute shortest absent words in a genomic sequence, we present a linear-time algorithm, which firstly estimates the length of shortest absent words by probabilistic method, and then based on such estimation, finds out all shortest absent words in a genomic sequence. Our algorithm only needs to scan the genomic sequence once without the space requirements of index structures such as suffix trees and suffix arrays. Experimental results show that our algorithm uses only 1.5 minutes for the computation of shortest absent words in human genome, and therefore is more efficient than any other existing algorithms.
What problem does this paper attempt to address?