PolyCluster: Minimum Fragment Disagreement Clustering for Polyploid Phasing

Sepideh Mazrouee,Wei Wang
DOI: https://doi.org/10.1109/tcbb.2018.2858803
2018-01-01
IEEE/ACM Transactions on Computational Biology and Bioinformatics
Abstract:Phasing is an emerging area in computational biology with important applications in clinical decision making and biomedical sciences. While machine learning techniques have shown tremendous potential in many biomedical applications, their utility in phasing has not yet been fully understood. In this paper, we investigate development of clustering-based techniques for phasing in polyploidy organisms where more than two copies of each chromosome exist in the cells of the organism under study. We develop a novel framework, called PolyCluster, based on the concept of correlation clustering followed by an effective cluster merging mechanism to minimize the amount of disagreement among short reads residing in each cluster. We first introduce a graph model to quantify the amount of similarity between each pair of DNA reads. We then present a combination of linear programming, rounding, region-growing, and cluster merging to group similar reads and reconstruct haplotypes. Our extensive analysis demonstrates the effectiveness of PolyCluster in accurate and scalable phasing. In particular, we show that PolyCluster reduces switching error of H-PoP, HapColor, and HapTree by 44.4, 51.2, and 48.3 percent, respectively. Also, the running time of PolyCluster is several orders-of-magnitude less than HapTree while it achieves a running time comparable to other algorithms.
What problem does this paper attempt to address?