Abstract:Image processing and machine learning applications benefit tremendously from hardware acceleration. Existing compilers target either FPGAs, which sacrifice power and performance for programmability, or ASICs, which become obsolete as applications change. Programmable domain-specific accelerators, such as coarse-grained reconfigurable arrays (CGRAs), have emerged as a promising middle-ground, but they have traditionally been difficult compiler targets since they use a different memory abstraction. In contrast to CPUs and GPUs, the memory hierarchies of domain-specific accelerators use push memories : memories that send input data streams to computation kernels or to higher or lower levels in the memory hierarchy and store the resulting output data streams. To address the compilation challenge caused by push memories, we propose that the representation of these memories in the compiler be altered to directly represent them by combining storage with address generation and control logic in a single structure—a unified buffer. The unified buffer abstraction enables the compiler to separate generic push memory optimizations from the mapping to specific memory implementations in the backend. This separation allows our compiler to map high-level Halide applications to different CGRA memory designs, including some with a ready-valid interface. The separation also opens the opportunity for optimizing push memory elements on reconfigurable arrays. Our optimized memory implementation, the Physical Unified Buffer, uses a wide-fetch, single-port SRAM macro with built-in address generation logic to implement a buffer with two read and two write ports. It is 18% smaller and consumes 31% less energy than a physical buffer implementation using a dual-port memory that only supports two ports. Finally, our system evaluation shows that enabling a compiler to support CGRAs leads to performance and energy benefits. Over a wide range of image processing and machine learning applications, our CGRA achieves 4.7× better runtime and 3.5× better energy-efficiency compared to an FPGA.

Atomic Cache: Enabling Efficient Fine-Grained Synchronization with Relaxed Memory Consistency on GPGPUs Through In-Cache Atomic Operations

Orchestrating Cache Management and Memory Scheduling for GPGPU Applications.

Put an Elephant into a Fridge

Re-Cache: Mitigating Cache Contention by Exploiting Locality Characteristics with Reconfigurable Memory Hierarchy for GPGPUs.

TensorCache: Reconstructing Memory Architecture with SRAM-Based In-Cache Computing for Efficient Tensor Computations in GPGPUs

Analyzing Memory Access on CPU-GPGPU Shared LLC Architecture

Advanced hybrid MRAM based novel GPU cache system for graphic processing with high efficiency

Unified Buffer: Compiling Image Processing and Machine Learning Applications to Push-Memory Accelerators

Coordinated Static and Dynamic Cache Bypassing for GPUs

ICGMM: CXL-enabled Memory Expansion with Intelligent Caching Using Gaussian Mixture Model

A Hybrid Approach to Cache Management in Heterogeneous CPU-FPGA Platforms.

Memory Coherency Based CPU-Cache-FPGA Acceleration Architecture for Cloud Computing

PCG: Mitigating Conflict-based Cache Side-channel Attacks with Prefetching

Buffer on Last Level Cache for CPU and GPGPU Data Sharing

At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads

POSTER: BACM: Barrier-Aware Cache Management for Irregular Memory-Intensive GPGPU Workloads

Design and Implementation of A High-Performance Microprocessor Cache Compression Algorithm

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

C-Pack: A High-Performance Microprocessor Cache Compression Algorithm

TriCache: A User-Transparent Block Cache Enabling High-Performance Out-of-Core Processing with In-Memory Programs

An Efficient Compiler Framework for Cache Bypassing on GPUs