Abstract:GPUs offer the promise of massive, power-efficient parallelism. However, exploiting this parallelism requires code to be carefully structured to deal with the limitations of the SIMT execution model. In recent years, there has been much interest in mapping irregular applications to GPUs: applications with unpredictable, data-dependent behaviors. While most of the work in this space has focused on ad hoc implementations of specific algorithms, recent work has looked at generic techniques for mapping a large class of tree traversal algorithms to GPUs, through careful restructuring of the tree traversal algorithms to make them behave more regularly. Unfortunately, even this general approach for GPU execution of tree traversal algorithms is reliant on ad hoc , handwritten, algorithm-specific scheduling ( i.e. , assignment of threads to warps) to achieve high performance. The key challenge of scheduling is that it is a highly irregular process, that requires the inspection of thread behavior and then careful sorting of the threads into warps. In this paper, we present a novel scheduling and execution technique for tree traversal algorithms that is both general and automatic. The key novelty is a hybrid approach: the GPU partially executes tasks to inspect thread behavior and transmits information back to the CPU, which uses that information to perform the scheduling itself, before executing the remaining, carefully scheduled, portion of the traversals on the GPU. We applied this framework to five tree traversal algorithms, achieving significant speedups over optimized GPU code that does not perform application-specific scheduling. Further, we show that in many cases, our hybrid approach is able to deliver better performance even than GPU code that uses hand-tuned, application-specific scheduling.

Understanding the SIMD Efficiency of Graph Traversal on GPU.

An Adaptive Breadth-First Search Algorithm on Integrated Architectures

An Efficient Approach to Vectorize the Hybrid Breadth-First Search

Specialization or Generalization: A Study on Breadth-First Graph Traversal on GPUs

Energy-Efficient Graph Traversal on Integrated CPU-GPU Architectures

Accelerating BFS Via Data Structure-Aware Prefetching on GPU

SEP-graph: Finding Shortest Execution Paths for Graph Processing under a Hybrid Framework on GPU

Excavating the Potential of GPU for Accelerating Graph Traversal

Understanding Parallelism in Graph Traversal on Multi-Core Clusters

Load-Balanced Breadth-First Search On Gpus

Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores

Optimizing Graph Processing on GPUs

Hybrid CPU-GPU scheduling and execution of tree traversals

WolfPath: Accelerating Iterative Traversing-Based Graph Processing Algorithms on GPU

Optimizing Data Accesses for Breadth-First Search on Shared Memory Computers.

A BSP Model Graph Processing System on Many Cores

Towards Accelerating Irregular EDA Applications with GPUs.

HPGA: A High-Performance Graph Analytics Framework on the GPU

GPU Accelerated On-the-Fly Reachability Checking

Graph Processing on GPUs: A Survey.

SURF: Direction-Optimizing Breadth-First Search Using Workload State on GPUs