Abstract:Parallel GPGPU applications rely on barrier synchronization to align thread block activity. Few prior work has studied and characterized barrier synchronization within a thread block and its impact on performance. In this paper, we find that barriers cause substantial stall cycles in barrier-intensive GPGPU applications although GPGPUs employ lightweight hardware-support barriers. To help investigate the reasons, we define the execution between two adjacent barriers of a thread block as a warp-phase. We find that the execution progress within a warp-phase varies dramatically across warps, which we call warp-phase-divergence. While warp-phase-divergence may result from execution time disparity among warps due to differences in application code or input, and/or shared resource contention, we also pinpoint that warp-phase-divergence may result from warp scheduling. To mitigate barrier induced stall cycle inefficiency, we propose barrier-aware warp scheduling (BAWS). It combines two techniques to improve the performance of barrier-intensive GPGPU applications. The first technique, most-waiting-first (MWF), assigns a higher scheduling priority to the warps of a thread block that has a larger number of warps waiting at a barrier. The second technique, critical-fetch-first (CFF), fetches instructions from the warp to be issued by MWF in the next cycle. To evaluate the efficiency of BAWS, we consider 13 barrier-intensive GPGPU applications, and we report that BAWS speeds up performance by 17% and 9% on average (and up to 35% and 30%) over loosely-round-robin (LRR) and greedy-then-oldest (GTO) warp scheduling, respectively. We compare BAWS against recent concurrent work SAWS, finding that BAWS outperforms SAWS by 7% on average and up to 27%. For non-barrier-intensive workloads, we demonstrate that BAWS is performance-neutral compared to GTO and SAWS, while improving performance by 5.7% on average (and up to 22%) compared to LRR. BAWS' hardware cost is limited to 6 bytes per streaming multiprocessor (SM).

Improving branch divergence performance on GPGPU with a new PDOM stack and multi-level warp scheduling.

Improve GPGPU Latency Hiding with a Hybrid Recovery Stack and a Window Based Warp Scheduling Policy.

An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization.

DAW-DMR: Divergence-Aware Warped DMR with Full Error Detection for GPGPU S

Dynamic-II Pipeline: Compiling Loops with Irregular Branches on Static-Scheduling CGRA

An Accurate Gpu Performance Model For Effective Control Flow Divergence Optimization

A GPU-Accelerated Framework for Path-Based Timing Analysis

DARM: Control-Flow Melding for SIMT Thread Divergence Reduction -- Extended Version

A CPU-GPGPU Scheduler Based on Data Transmission Bandwidth of Workload

Orchestrating Cache Management and Memory Scheduling for GPGPU Applications.

Stack-based Parallel Recursion on Graphics Processors.

LWSDP: Locality-Aware Warp Scheduling and Dynamic Data Prefetching Co-design in the Per-SM Private Cache of GPGPUs

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

Hybrid CPU-GPU scheduling and execution of tree traversals

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

POSTER: High Performance GPU Concurrent B plus tree

Barrier-Aware Warp Scheduling for Throughput Processors.

Adaptive Data Path Selection for Durable Transaction in GPU Persistent Memory

New software pipelining branch - intensive loops

Improving Performance of Dynamic Programming Via Parallelism and Locality on Multicore Architectures

WAP: the Warp Feature Aware Prefetching Method for LLC on CPU-GPU Heterogeneous Architecture