Abstract:With the advent of multicores, multithreaded programming has acquired increased importance. In order to obtain good performance, the synchronization constructs in multithreaded programs need to be carefully implemented. These implementations can be broadly classified into two categories: busy–wait and schedule‐based. For shared memory architectures, busy–wait synchronizations are preferred over schedule‐based synchronizations because they can achieve lower wakeup latency, especially when the expected wait time is much shorter than the scheduling time. While busy–wait synchronizations can improve the performance of multithreaded programs running on multicore machines, they create a challenge in program debugging, especially in detecting and identifying the causes of data races. Although significant research has been done on data race detection, prior works rely on one important assumption—the debuggers are aware of all the synchronization operations performed during a program run. This assumption is a significant limitation as multithreaded programs, including the popular SPLASH‐2 benchmark have busy–wait synchronizations such as barriers and flag synchronizations implemented in the user code. We show that the lack of knowledge of these synchronization operations leads to unnecessary reporting of numerous races. To tackle this problem, we propose a dynamic technique for identifying user‐defined synchronizations that are performed during a program run. Both software and hardware implementations are presented. Furthermore, our technique can be easily exploited by a record/replay system to significantly speedup the replay. It can also be leveraged by a transactional memory system to effectively resolve a livelock situation. Our evaluation confirms that our synchronization detector is highly accurate with no false negatives and very few false positives. We further observe that the knowledge of synchronization operations results in 23% reduction in replay time. Finally, we show that using synchronization knowledge livelocks can be efficiently avoided during runtime monitoring of programs. Copyright © 2009 John Wiley & Sons, Ltd.

Compile-Time Automatic Synchronization Insertion and Redundant Synchronization Elimination for GPU Kernels.

A Framework for Fine-Grained Synchronization of Dependent GPU Kernels

Efficient Synchronization Primitives for GPUs

Improving the Scalability of GPU Synchronization Primitives

ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code

A Polyhedral Modeling Based Source-to-Source Code Optimization Framework for GPGPU

A Compiler-assisted Locality Aware CTA Mapping Scheme

ACS: Concurrent Kernel Execution on Irregular, Input-Dependent Computational Graphs

Making GPU Warp Scheduler and Memory Scheduler Synchronization-Aware

Efficient Kernel Management on GPUs.

Protecting Synchronization Mechanisms of Parallel Big Data Kernels via Logging

ICCAD : U : Optimizing GPU Shared Memory Allocation in Automated Cto-CUDA Compilation

Automatic Horizontal Fusion for GPU Kernels

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

Register and Thread Structure Optimization for GPUs

Automated Dynamic Detection of Busy–wait Synchronizations

Exploiting Parallelism in the Simulation of General Purpose Graphics Processing Unit Program

An Efficient Compiler Framework for Cache Bypassing on GPUs

Reducing overheads of dynamic scheduling on heterogeneous chips

Global Optimizations & Lightweight Dynamic Logic for Concurrency

Synchronization Coherence: A Transparent Hardware Mechanism For Cache Coherence And Fine-Grained Synchronization