CuMen: Clustering Sequences Based on Maximal Frequent Sequential Pattern and its Application in Genome Sequence Assembly

HUANG Dong,TANG Jun,WANG Wei,SHI Bai-Le
2005-01-01
Computer Science
Abstract:Sequencing genomes is a fundamental aspect of biological research. A variety of assembly programs have been previously proposed and implemented. Because of great computational complexity and increasingly large size, they incur great time and space overhead. In realistic applications, sequencing process might come to become unacceptably slow for insufficient memory even with a mainframe with huge RAM. This paper offeres a clustering algorithm based on maximal frequent sequential patterns, aiming at divide the whole dataset into several parts which can be processed inde- pendently and efficiently in limited memory. Some techniques are applied to optimize the mining and clustering proce- dure. This approach is introduced into grid environment, exploiting parallelism and distribution for improving scalabili- ty further.
What problem does this paper attempt to address?