Abstract:The performance of graphics processing unit (GPU) systems is improving rapidly to accommodate the increasing demands of graphics and high-performance computing applications. With such a performance improvement, however, power consumption of GPU systems is dramatically increased. Up to 30&percnt; of the total power of a GPU system is consumed by the graphic memory itself. Therefore, reducing graphics memory power consumption is critical to mitigate the power challenge. In this article, we propose an energy-efficient reconfigurable 3D die-stacking graphics memory design that integrates wide-interface graphics DRAMs side-by-side with a GPU processor on a silicon interposer. The proposed architecture is a “3D+2.5D” system, where the DRAM memory itself is 3D stacked memory with through-silicon via (TSV), whereas the integration of DRAM and the GPU processor is through the interposer solution (2.5D). Since GPU computing units, memory controllers, and memory are all integrated in the same package, the number of memory I/Os is no longer constrained by the package’s pin count. We can reduce the memory power consumption by scaling down the supply voltage and frequency of memory interface while maintaining the same or even higher peak memory bandwidth. In addition, we design a reconfigurable memory interface that can dynamically adapt to the requirements of various applications. We propose two reconfiguration mechanisms to optimize the GPU system energy efficiency and throughput, respectively, and thus benefit both memory-intensive and compute-intensive applications. The experimental results show that the proposed GPU memory architecture can effectively improve GPU system energy efficiency by 21&percnt;, without reconfiguration. The reconfigurable memory interface can further improve the system energy efficiency by 26&percnt;, and system throughput by 31&percnt; under a capped system power budget of 240W.

Dynamic Per-Warp Reconvergence Stack for Efficient Control Flow Handling in GPUs

Efficient Data Management for Incoherent Ray Tracing.

Improve GPGPU Latency Hiding with a Hybrid Recovery Stack and a Window Based Warp Scheduling Policy.

Improving branch divergence performance on GPGPU with a new PDOM stack and multi-level warp scheduling.

An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization.

An Accurate Gpu Performance Model For Effective Control Flow Divergence Optimization

DAW-DMR: Divergence-Aware Warped DMR with Full Error Detection for GPGPU S

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

Parallel Transient Stability-Constrained Optimal Power Flow Using GPU as Coprocessor.

Dynamic Stencil: Effective Exploitation of Run-Time Resources in Reconfigurable Clusters

A Stack-Centric Processing Model for Iterative Processing

High Performance Hardware Stack for Seamless Context Switching

Stack-based Parallel Recursion on Graphics Processors.

Warp-Aware Adaptive Energy Efficiency Calibration for Multi-GPU Systems

Re-Cache: Mitigating Cache Contention by Exploiting Locality Characteristics with Reconfigurable Memory Hierarchy for GPGPUs.

Exploiting Scratchpad Memory for Deep Temporal Blocking: A case study for 2D Jacobian 5-point iterative stencil kernel (j2d5pt)

LWSDP: Locality-Aware Warp Scheduling and Dynamic Data Prefetching Co-design in the Per-SM Private Cache of GPGPUs

Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface

Graph-oriented Code Transformation Approach for Register-Limited Stencils on GPUs

Coordinated Static and Dynamic Cache Bypassing for GPUs

Enabling Software Resilience in GPGPU Applications via Partial Thread Protection