DGCF: A Distributed Greedy Clustering Framework for Large-scale Genomic Sequences.

Zekun Yin,Xiaoming Xu,Kaichao Fan,Ruilin Li,Weizhong Li,Weiguo Liu,Beifang Niu
DOI: https://doi.org/10.1109/bibm47256.2019.8983385
2019-01-01
Abstract:Clustering is a very fundamental while time-consuming compute operation in biological sequence analysis. New sequencing technologies such as NGS and 3GS have dramatically increased both the dataset size and the length of a single read sequence. However, existing tools lack scalability for handling large-scale datasets as well as long sequences. A feasible solution to this problem is to use parallel and distributed systems. The efficient deployment of such systems, however, requires high parallelism in both software implementations as well as algorithmic optimizations. In this paper, we propose DGCF, a Distributed Greedy Clustering Framework which is capable to handle large-scale datasets and long sequences. Our framework adopts a greedy clustering strategy which overlaps communication with computation among many distributed computing nodes. We also design and implement a sparse suffix array (SSA)-based alignment algorithm that can support long sequences. Experiments show that our framework achieves near-linear speedups on a distributed memory cluster.
What problem does this paper attempt to address?