Reference Sequence Selection for Motif Searches

Qiang Yu,Hongwei Huo,Ruixing Zhao,Dazheng Feng,Jeffrey Scott Vitter,Jun Huan
DOI: https://doi.org/10.1109/bibm.2015.7359745
2015-01-01
Abstract:The planted (l, d) motif search (PMS) is an important yet challenging problem in computational biology. Patterndriven PMS algorithms usually use k out of t input sequences as reference sequences to generate candidate motifs, and they can find all the (l, d) motifs in the input sequences. However, most of them simply take the first k sequences in the input as reference sequences without elaborate selection processes, and thus they may exhibit sharp fluctuations in running time, especially for large alphabets. In this paper, we build the reference sequence selection problem and propose a method named RefSelect to quickly solve it by evaluating the number of candidate motifs for the reference sequences. RefSelect can bring a practical time improvement of the state-of-the-art pattern-driven PMS algorithms. Experimental results show that RefSelect (1) makes the tested algorithms solve the PMS problem steadily in an efficient way, (2) particularly, makes them achieve a speedup of up to about 100× on the protein data, and (3) is also suitable for large data sets which contain hundreds or more sequences.
What problem does this paper attempt to address?