Abstract:Application programs that exhibit strong locality of reference lead to minimized cache misses and better performance in different architectures. However, to maximize the performance of multithreaded applications running on emerging manycore systems, data movement in on-chip network should also be minimized. Unfortunately, the way many multithreaded programs are written does not lend itself well to minimal data movement. Motivated by this observation, in this paper, we target task-based programs (which cover a large set of available multithreaded programs), and propose a novel compiler-based approach that consists of four complementary steps. First, we partition the original tasks in the target application into sub-tasks and build a data reuse graph at a sub-task granularity. Second, based on the intensity of temporal and spatial data reuses among sub-tasks, we generate new tasks where each such (new) task includes a set of sub-tasks that exhibit high data reuse among them. Third, we assign the newly-generated tasks to cores in an architecture-aware fashion with the knowledge of data location. Finally, we re-schedule the execution order of sub-tasks within new tasks such that sub-tasks that belong to different tasks but share data among them are executed in close proximity in time. The detailed experiments show that, when targeting a state of the art manycore system, our proposed compiler-based approach improves the performance of 10 multithreaded programs by 23.4% on average, and it also outperforms two state-of-the-art data access optimizations for all the benchmarks tested. Our results also show that the proposed approach i) improves the performance of multiprogrammed workloads, and ii) generates results that are close to maximum savings that could be achieved with perfect profiling information. Overall, our experimental results emphasize the importance of dividing an original set of tasks of an application into sub-tasks and constructing new tasks from the resulting sub-tasks in a data movement- and locality-aware fashion.

A Compiler-assisted Locality Aware CTA Mapping Scheme

TAEM 2.0: A Faster Transfer-Aware Effective Loop Mapping for Heterogeneous Resources on CGRA.

A Polyhedral Modeling Based Source-to-Source Code Optimization Framework for GPGPU

ICCAD : U : Optimizing GPU Shared Memory Allocation in Automated Cto-CUDA Compilation

Providing Source Code Level Portability Between Cpu and Gpu with Mapcg

An Efficient Compiler Framework for Cache Bypassing on GPUs

Optimizing Spatial Mapping of Nested Loop for Coarse-Grained Reconfigurable Architectures

SPGPU: Spatially Programmed GPU

Memory-Aware Loop Mapping on Coarse-Grained Reconfigurable Architectures

Mapping Multi-Level Loop Nests Onto CGRAs Using Polyhedral Optimizations.

Enabling Coordinated Register Allocation and Thread-Level Parallelism Optimization for Gpus

Mapcg: Writing Parallel Program Portable Between Cpu And Gpu

Polyhedral Model Based Mapping Optimization Of Loop Nests For Cgras

Mix and Match: Reorganizing Tasks for Enhancing Data Locality

A-MapCG: an Adaptive MapReduce Framework for GPUs.

Combining Memory Partitioning and Subtask Generation for Parallel Data Access on CGRAs

Last Level Cache Layout Remapping for Heterogeneous Systems

Orchestrating Cache Management and Memory Scheduling for GPGPU Applications.

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

Conflict-aware compiler for hierarchical register file on GPUs

Compile-Time Automatic Synchronization Insertion and Redundant Synchronization Elimination for GPU Kernels.