Abstract:Large scale graph analysis algorithms–such as those in SCCA 2 benchmarks studied in this paper–play an increasingly important role in high performa nce computing applications. Different from most of traditional scientific computing applications , graph algorithms often show dynamic and irregular computing behavior. It is difficult to attain g ood performance on large scale conventional parallel architectures because these programs exhi bit (i). little locality and data reuse, (ii). dynamically non-contigous memory access pattern that is le ss amendable to static analysis and (iii). fine grain parallelism requring lock synchronization. With the rapid advance of multi-core/manycore chip technology , some new architecture features are em erging: the traditional data cache is being replaced with fast memories (sometime called scratch -pad memories) local to the cores in an explicity (user visible) memory hierarchy, and a large numb er of processing cores (sometime upto hundreds) are becoming available on a single chip. This pres ents both challenges and opportunities for mapping graph algorithms to be studied in this paper. In this paper, a scalable parallel algorithm for computing b etweenness centrality in scale free sparse graph is proposed and its performance and scalabilit y is nvestigated. In particular, our algorithm addresses the parallelization challenges in the foll owing ways: 1. We restructure the parallel algorithm to address the loca lity challenges by overlapping the latency of prefetching off-chip data into on-chip memory (v ia an explicit memory heirarchy of the underline many-core architcture) with computation i n a pipelined fashion; 2. We “gather” the dynamically non-contigous off-chip memo ry accesses and convert them into contigous on-chip memory accesses i.e. “create” on-chip s patial locality just in time; 3. The fine-grain synchronization overhead due to locking is reduced by taking advantage of a specific fine-grain lock mechanism on a many-core architectu r and a novel lock free algorithm through exploiting addtional parallelism; 4. Our solution above take full advantage of the ample hardwa re thread unit resource to assist the parallel computation to manage data movement through me mory hierarchy as well as finegrain data synchronization. We have implemented our algorithm on the 160 core IBM Cyclops -64 chip architecture. Our experiemental results confirmed the effectiveness of our me thods in addressing the performance and scalability challenges of the studied graph problems.

A Study of Parallel Betweenness Centrality Algorithm on a Manycore Architecture

Analysis and Performance Results of Computing Betweenness Centrality on IBM Cyclops64

Characterizing Betweenness Centrality Algorithm On Multi-Core Architectures

Fine-Grained Parallel Betweenness Centrality Algorithm Without Lock Synchronization

Parallelizing Clique and Quasi-Clique Detection over Graph Data

A Parallel Algorithm for Computing Betweenness Centrality

Efficient parallel algorithms for dynamic closeness‐ and betweenness centrality

Scalable Parallel Distributed Coprocessor System for Graph Searching Problems with Massive Data

Parallelizing Maximal Clique Enumeration Over Graph Data.

Vertex-centric Parallel Algorithms for Identifying Key Vertices in Large-Scale Graphs

Parallel Strong Connectivity Based on Faster Reachability

Parallelizing Maximal Clique and K-Plex Enumeration over Graph Data

Tuning the granularity of parallelism for distributed graph processing

An Efficient Parallel Algorithm of N-hop Neighborhoods on Graphs in Distributed Environment

Efficient Maximal Clique Enumeration Over Graph Data

A Stack-Centric Processing Model for Iterative Processing

Asynchronous Parallel Dijkstra's Algorithm on Intel Xeon Phi Processor - How to Accelerate Irregular Memory Access Algorithm.

Understanding Parallelism in Graph Traversal on Multi-Core Clusters

Frequent Graph Mining on Multi-Core Processor

Fast Uncovering of Graph Communities on a Chip: Toward Scalable Community Detection on Multicore and Manycore Platforms.

A Fine-Grained Hybrid CPU-GPU Algorithm for Betweenness Centrality Computations