Abstract:The key to high performance for GPU architecture lies in its massive threading capability to drive a large number of cores and enable execution overlapping among threads. However, in reality, the number of threads that can simultaneously execute is often limited by the size of the register file on GPUs. The traditional SRAM-based register file takes up so large amount of chip area that it cannot scale to meet the increasing demand of GPU applications. Racetrack memory (RM) is a promising technology for designing large capacity register file on GPUs due to its high data storage density. However, without careful deployment of RM-based register file, the lengthy shift operations of RM may hurt the performance. In this paper, we explore RM for designing high-performance register file for GPU architecture. High storage density RM helps to improve the thread level parallelism (TLP), but if the bits of the registers are not aligned to the ports, shift operations are required to move the bits to the access ports before they are accessed, and thus the read/write operations are delayed. We develop an optimization framework for RM-based register file on GPUs, which employs three different optimization techniques at the application, compilation, and architecture level, respectively. More clearly, we optimize the TLP at the application level, design a register mapping algorithm at the compilation level, and design a preshifting mechanism at the architecture level. Collectively, these optimizations help to determine the TLP without causing cache and register file resource contention and reduce the shift operation overhead. Experimental results using a variety of representative workloads demonstrate that our optimization framework achieves up to 29% (21% on average) performance improvement.

PRF: a process-RAM-feedback performance model to reveal bottlenecks and propose optimizations

Modeling and Benchmarking Computing-in-Memory for Design Space Exploration.

A design framework for processing-in-memory accelerator

Performance Modeling of Stencil Computation on SW26010 Processors

Performance Modeling Sparse MTTKRP Using Optical Static Random Access Memory on FPGA

Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation

FPSA: A Full System Stack Solution for Reconfigurable ReRAM-based NN Accelerator Architecture

Quantitative modeling of racetrack memory, a tradeoff among area, performance, and power

Performance-Centric Optimization for Racetrack Memory Based Register File on GPUs

Architecture-circuit-technology Co-Optimization for Resistive Random Access Memory-Based Computation-in-memory Chips

PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory.

NeRF-PIM: PIM Hardware-Software Co-Design of Neural Rendering Networks

MNSIM-TIME: Performance Modeling Framework for Training-In-Memory Architectures

Asymmetric-access aware optimization for STT-RAM caches with process variations.

RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing

SDP: Co-Designing Algorithm, Dataflow, and Architecture for In-SRAM Sparse NN Acceleration

Prefetching Techniques for STT-RAM Based Last-Level Cache in CMP Systems

Moguls: A Model To Explore The Memory Hierarchy For Bandwidth Improvements

RRAM-based Floating-Point In-Memory-Computing Architecture for High Throughput Signal Processing

Puppeteer: A Random Forest-based Manager for Hardware Prefetchers across the Memory Hierarchy

A Performance Model for Run-Time Reconfigurable Hardware Accelerator