Abstract:The coarse-grained reconfigurable architecture (CGRA) is proven to be energy efficient in several specific domains. In CGRAs, the on-chip memory hierarchy, which contains the context memory and the data memory organizations, should be well considered to achieve appropriate tradeoffs among three aspects: 1) performance; 2) area; and 3) power. In this paper, two techniques called the hierarchical configuration context (HCC) and the lifetime-based data-memory organization (LDO) focusing on the context memory and the data memory organizations are proposed to compress the on-chip memory space and to reduce the reconfiguration time and the data-reference time. In the HCC, the contexts are constructed in a hierarchical fashion to completely eliminate the repetitive portions of the contexts, not only reducing the overall context storage, but also alleviating the context transportation overhead. A fast context-indexing mechanism in the HCC is proposed to achieve fast reconfiguration, as the hierarchically organized contexts can be located and accessed conveniently. In the LDO, the on-chip data are classified into two types, based on the lifetime of data. The short-lifetime data are stored in the first in first out to increase the reuse ratio of memory space automatically, whereas the long-lifetime data are stored in the radom access memory for several time references. The HCC and the LDO are used in a CGRA core called as reconfigurable processing unit (RPU). Two RPUs are integrated in a reconfigurable computing processor (RCP) called as REconfigurable MUlti-media System, High-Performance Processor (REMUS_HPP). Because of the HCC, compared with a traditional nonhierarchical system, the total context storage required in H.264 decoding is reduced by 77%. Because of the LDO, the normalized on-chip data memory size at same performance level in the REMUS_HPP is only 23.8% and 14.8% of those in XPP-III (a high-performance RCP) and ADRES (a low-power RCP). REMUS_HPP is implemented on a 48.9-mm 2 silicon with TSMC 65-nm technology, using a 200-MHz working frequency to achieve 1920 × 1088 at 30 fps H.264 high-profile decoding. Compared with XPP-III, the performance of the REMUS_HPP is 1.81× boosted, whereas the energy efficiency is 4.75× higher.

Stream Processing Dual-Track CGRA for Object Inference

DT-CGRA: Dual-track Coarse-Grained Reconfigurable Architecture for Stream Applications

DaDianNao: A Machine-Learning Supercomputer

CFEACT: A CGRA-based Framework Enabling Agile CNN and Transformer Accelerator Design

Mixed-granularity Parallel Coarse-Grained Reconfigurable Architecture

Combining Memory Partitioning and Subtask Generation for Parallel Data Access on CGRAs

MDCRA: A Reconfigurable Accelerator Framework for Multiple Dataflow Lanes

Cost-Effective Memory Architecture to Achieve Flexible Configuration and Efficient Data Transmission for Coarse-Grained Reconfigurable Array (Abstract Only).

CATERPILLAR: Coarse Grain Reconfigurable Architecture for Accelerating the Training of Deep Neural Networks

STRELA: STReaming ELAstic CGRA Accelerator for Embedded Systems

HierCGRA: A Novel Framework for Large-Scale CGRA with Hierarchical Modeling and Automated Design Space Exploration

An Architecture-Agnostic Dataflow Mapping Framework on CGRA

On-Chip Memory Hierarchy in One Coarse-Grained Reconfigurable Architecture to Compress Memory Space and to Reduce Reconfiguration Time and Data-Reference Time

Configuration Approaches to Enhance Computing Efficiency of Coarse-Grained Reconfigurable Array.

Automated Design Space Exploration of CGRA Processing Element Architectures using Frequent Subgraph Analysis

R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRA

DFGNet: Mapping Dataflow Graph Onto CGRA by a Deep Learning Approach

A Reconfigurable Spatial Architecture for Energy-Efficient Inception Neural Networks

A Dynamic Partial Reconfigurable CGRA Framework for Multi-Kernel Applications

Enhancing CGRA Efficiency Through Aligned Compute and Communication Provisioning

CGRA4ML: A Framework to Implement Modern Neural Networks for Scientific Edge Computing