Abstract:Recently GPUs have risen as one important parallel platform for general purpose applications, both in HPC and cloud environments. Due to the special execution model, developing programs for GPUs is difficult even with the recent introduction of high-level languages like CUDA and OpenCL. To ease the programming efforts, some research has proposed automatically generating parallel GPU codes by complex compile-time techniques. However, this approach can only parallelize loops 100% free of inter-iteration dependencies (i.e., DOALL loops). To exploit runtime parallelism, which cannot be proven by static analysis, in this work, we propose GPU-TLS, a runtime system to speculatively parallelize possibly-parallel loops in sequential programs on GPUs.GPU-TLS parallelizes a possibly-parallel loop by chopping it into smaller sub-loops, each of which is executed in parallel by a GPU kernel, speculating that no inter-iteration dependencies exist. After dependency checking, the buffered writes of iterations without mis-speculations are copied to the master memory while iterations encountering mis-speculations are re-executed. GPU-TLS addresses several key problems of speculative loop parallelization on GPUs: (1) The larger mis-speculation rate caused by larger number of threads is reduced by three approaches: the loop chopping parallelization approach, the deferred memory update scheme and intra-warp value forwarding method. (2) The larger overhead of dependency checking is reduced by a hybrid scheme: eager intra-warp dependency checking combined with lazy inter-warp dependency checking. (3) The bottleneck of serial commit is alleviated by a parallel commit scheme, which allows different iterations to enter the commit phase out of order but still guarantees sequential semantics.Extensive evaluations using both microbenchmarks and real-life applications on two recent NVIDIA GPU cards show that speculative loop parallelization using GPU-TLS can achieve speedups ranging from 5 to 160 for sequential programs with possibly-parallel loops.

A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs

Barrier-Aware Warp Scheduling for Throughput Processors.

CWLP: Coordinated Warp Scheduling and Locality-Protected Cache Allocation on GPUs.

LWSDP: Locality-Aware Warp Scheduling and Dynamic Data Prefetching Co-design in the Per-SM Private Cache of GPGPUs

An Optimized GP-GPU Warp Scheduling Algorithm for Sparse Matrix-Vector Multiplication

A Credit-Based Load-Balance-Aware CTA Scheduling Optimization Scheme in GPGPU

WaSP: Warp Scheduling to Mimic Prefetching in Graphics Workloads

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

Improve GPGPU Latency Hiding with a Hybrid Recovery Stack and a Window Based Warp Scheduling Policy.

A CPU-GPGPU Scheduler Based on Data Transmission Bandwidth of Workload

WSMP: a Warp Scheduling Strategy Based on MFQ and PPF

LFWS: Long-Operation First Warp Scheduling Algorithm to Effectively Hide the Latency for GPUs

Improving Simd Utilization with Thread-Lane Shuffled Compaction in Gpgpu

A Survey of GPGPU Parallel Processing Architecture Performance Optimization

An Approximate Optimal Solution to GPU Workload Scheduling

Locality based warp scheduling in GPGPUs.

Adaptive Cache and Concurrency Allocation on GPGPUs

GPGPU-Based Parallel Algorithms for Scheduling Against Due Date.

Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPU

Gpu-Tls: An Efficient Runtime For Speculative Loop Parallelization On Gpus

Warp-Aware Adaptive Energy Efficiency Calibration for Multi-GPU Systems