A Study of Parallel Betweenness Centrality Algorithm on a Manycore Architecture

Guangming Tan,G. Gao
2007-01-01
Abstract:Large scale graph analysis algorithms–such as those in SCCA 2 benchmarks studied in this paper–play an increasingly important role in high performa nce computing applications. Different from most of traditional scientific computing applications , graph algorithms often show dynamic and irregular computing behavior. It is difficult to attain g ood performance on large scale conventional parallel architectures because these programs exhi bit (i). little locality and data reuse, (ii). dynamically non-contigous memory access pattern that is le ss amendable to static analysis and (iii). fine grain parallelism requring lock synchronization. With the rapid advance of multi-core/manycore chip technology , some new architecture features are em erging: the traditional data cache is being replaced with fast memories (sometime called scratch -pad memories) local to the cores in an explicity (user visible) memory hierarchy, and a large numb er of processing cores (sometime upto hundreds) are becoming available on a single chip. This pres ents both challenges and opportunities for mapping graph algorithms to be studied in this paper. In this paper, a scalable parallel algorithm for computing b etweenness centrality in scale free sparse graph is proposed and its performance and scalabilit y is nvestigated. In particular, our algorithm addresses the parallelization challenges in the foll owing ways: 1. We restructure the parallel algorithm to address the loca lity challenges by overlapping the latency of prefetching off-chip data into on-chip memory (v ia an explicit memory heirarchy of the underline many-core architcture) with computation i n a pipelined fashion; 2. We “gather” the dynamically non-contigous off-chip memo ry accesses and convert them into contigous on-chip memory accesses i.e. “create” on-chip s patial locality just in time; 3. The fine-grain synchronization overhead due to locking is reduced by taking advantage of a specific fine-grain lock mechanism on a many-core architectu r and a novel lock free algorithm through exploiting addtional parallelism; 4. Our solution above take full advantage of the ample hardwa re thread unit resource to assist the parallel computation to manage data movement through me mory hierarchy as well as finegrain data synchronization. We have implemented our algorithm on the 160 core IBM Cyclops -64 chip architecture. Our experiemental results confirmed the effectiveness of our me thods in addressing the performance and scalability challenges of the studied graph problems.
Computer Science
What problem does this paper attempt to address?