Abstract:Applications running concurrently in CMP systems interfere with each other at DRAM memory, leading to poor system performance and fairness. Memory access scheduling reorders memory requests to improve system throughput and fairness. However, it cannot resolve the interference issue effectively. To reduce interference, memory partitioning divides memory resource among threads. Memory channel partitioning maps the data of threads that are likely to severely interfere with each other to different channels. However, it allocates memory resource unfairly and physically exacerbates memory contention of intensive threads, thus ultimately resulting in the increased slowdown of these threads and high system unfairness. Bank partitioning divides memory banks among cores and eliminates interference. However, previous equal bank partitioning restricts the number of banks available to individual thread and reduces bank level parallelism. In this paper, we first propose a Dynamic Bank Partitioning (DBP), which partitions memory banks according to threads' requirements for bank amounts. DBP compensates for the reduced bank level parallelism caused by equal bank partitioning. The key principle is to profile threads' memory characteristics at run-time and estimate their demands for bank amount, then use the estimation to direct our bank partitioning. Second, we observe that bank partitioning and memory scheduling are orthogonal in the sense; both methods can be illuminated when they are applied together. Therefore, we present a comprehensive approach which integrates Dynamic Bank Partitioning and Thread Cluster Memory scheduling (DBP-TCM, TCM is one of the best memory scheduling) to further improve system performance. Experimental results show that the proposed DBP improves system performance by 4.3% and improves system fairness by 16% over equal bank partitioning. Compared to TCM, DBP-TCM improves system throughput by 6.2% and fairness by 16.7%. When compared with MCP, DBP-TCM p- ovides 5.3% better system throughput and 37% better system fairness. We conclude that our methods are effective in improving both system throughput and fairness.

Improve GPGPU Latency Hiding with a Hybrid Recovery Stack and a Window Based Warp Scheduling Policy.

Improving branch divergence performance on GPGPU with a new PDOM stack and multi-level warp scheduling.

LWSDP: Locality-Aware Warp Scheduling and Dynamic Data Prefetching Co-design in the Per-SM Private Cache of GPGPUs

DAW-DMR: Divergence-Aware Warped DMR with Full Error Detection for GPGPU S

Orchestrating Cache Management and Memory Scheduling for GPGPU Applications.

Barrier-Aware Warp Scheduling for Throughput Processors.

WAP: the Warp Feature Aware Prefetching Method for LLC on CPU-GPU Heterogeneous Architecture

WaSP: Warp Scheduling to Mimic Prefetching in Graphics Workloads

A CPU-GPGPU Scheduler Based on Data Transmission Bandwidth of Workload

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

Warp-Aware Adaptive Energy Efficiency Calibration for Multi-GPU Systems

Exploiting Scratchpad Memory for Deep Temporal Blocking: A case study for 2D Jacobian 5-point iterative stencil kernel (j2d5pt)

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

Dynamic-II Pipeline: Compiling Loops with Irregular Branches on Static-Scheduling CGRA

Enabling Software Resilience in GPGPU Applications via Partial Thread Protection

Re-Cache: Mitigating Cache Contention by Exploiting Locality Characteristics with Reconfigurable Memory Hierarchy for GPGPUs.

An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs.

POSTER: BACM: Barrier-Aware Cache Management for Irregular Memory-Intensive GPGPU Workloads

LockillerTM: Enhancing Performance Lower Bounds in Best-Effort Hardware Transactional Memory

Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning

Speculative Parallelization Using State Separation and Multiple Value Prediction.