EST Clustering in Large Dataset with MapReduce

Chunyu Wang,Maozu Guo,Yang Liu
DOI: https://doi.org/10.1109/PCSPA.2010.239
2010-01-01
Abstract:Analysis about EST data usually starts with EST clustering, the process of grouping fragments according their original consensus long sequence. The similarity between ESTs always means that part of the sequences match with each other in some way. Accurate clustering is quadratic in time in average EST length and numbers, and the number of ESTs in public EST database is increasing exponentially. With the help of cloud computing, we provide an k-mer based MapReduce algorithm for EST clustering in large dataset on commodity computers, and implement the algorithm in mrClust package. The result shows it is scalable and efficient for large EST dataset.
What problem does this paper attempt to address?