Abstract:Image processing and machine learning applications benefit tremendously from hardware acceleration. Existing compilers target either FPGAs, which sacrifice power and performance for programmability, or ASICs, which become obsolete as applications change. Programmable domain-specific accelerators, such as coarse-grained reconfigurable arrays (CGRAs), have emerged as a promising middle-ground, but they have traditionally been difficult compiler targets since they use a different memory abstraction. In contrast to CPUs and GPUs, the memory hierarchies of domain-specific accelerators use push memories : memories that send input data streams to computation kernels or to higher or lower levels in the memory hierarchy and store the resulting output data streams. To address the compilation challenge caused by push memories, we propose that the representation of these memories in the compiler be altered to directly represent them by combining storage with address generation and control logic in a single structure—a unified buffer. The unified buffer abstraction enables the compiler to separate generic push memory optimizations from the mapping to specific memory implementations in the backend. This separation allows our compiler to map high-level Halide applications to different CGRA memory designs, including some with a ready-valid interface. The separation also opens the opportunity for optimizing push memory elements on reconfigurable arrays. Our optimized memory implementation, the Physical Unified Buffer, uses a wide-fetch, single-port SRAM macro with built-in address generation logic to implement a buffer with two read and two write ports. It is 18% smaller and consumes 31% less energy than a physical buffer implementation using a dual-port memory that only supports two ports. Finally, our system evaluation shows that enabling a compiler to support CGRAs leads to performance and energy benefits. Over a wide range of image processing and machine learning applications, our CGRA achieves 4.7× better runtime and 3.5× better energy-efficiency compared to an FPGA.

Embrace the Conflicts: Exploring the Integration of Single Port Memory in Systolic Array-based Accelerators.

A design framework for processing-in-memory accelerator

OnSRAM: Efficient Inter-Node On-Chip Scratchpad Management in Deep Learning Accelerators

DPA: Demand-Based Partition and Data Allocation for Hybrid On-Chip Memory

EMS: Efficient Memory Subsystem Synthesis for Spatial Accelerators

Compiling Halide Programs to Push-Memory Accelerators

Memory Access Optimization of a Neural Network Accelerator Based on Memory Controller

Unified Buffer: Compiling Image Processing and Machine Learning Applications to Push-Memory Accelerators

Near-Memory Parallel Indexing and Coalescing: Enabling Highly Efficient Indirect Access for SpMV

Combinatorics and Geometry for the Many-ported, Distributed and Shared Memory Architecture

Design Space Exploration of Algorithmic Multi-Port Memories in High-Performance Application-Specific Accelerators

Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array Integration

A Heterogeneous Microprocessor for Intermittent AI Inference Using Nonvolatile-SRAM-based Compute-In-Memory

On Designing Efficient and Reliable Nonvolatile Memory-Based Computing-In-Memory Accelerators

Addressing the issue of processing element under-utilization in general-purpose systolic deep learning accelerators

Designing Efficient and High-performance AI Accelerators with Customized STT-MRAM

Energy-efficient SNN Architecture using 3nm FinFET Multiport SRAM-based CIM with Online Learning

Computing Utilization Enhancement for Chiplet-based Homogeneous Processing-in-Memory Deep Learning Processors

Architecting On-Chip Interconnects for Stacked 3D STT-RAM Caches in CMPs

System and Design Technology Co-optimization of SOT-MRAM for High-Performance AI Accelerator Memory System

A Systolic Computing-in-Memory Array Based Accelerator with Predictive Early Activation for Spatiotemporal Convolutions