Abstract:The coarse-grained reconfigurable architecture (CGRA) is proven to be energy efficient in several specific domains. In CGRAs, the on-chip memory hierarchy, which contains the context memory and the data memory organizations, should be well considered to achieve appropriate tradeoffs among three aspects: 1) performance; 2) area; and 3) power. In this paper, two techniques called the hierarchical configuration context (HCC) and the lifetime-based data-memory organization (LDO) focusing on the context memory and the data memory organizations are proposed to compress the on-chip memory space and to reduce the reconfiguration time and the data-reference time. In the HCC, the contexts are constructed in a hierarchical fashion to completely eliminate the repetitive portions of the contexts, not only reducing the overall context storage, but also alleviating the context transportation overhead. A fast context-indexing mechanism in the HCC is proposed to achieve fast reconfiguration, as the hierarchically organized contexts can be located and accessed conveniently. In the LDO, the on-chip data are classified into two types, based on the lifetime of data. The short-lifetime data are stored in the first in first out to increase the reuse ratio of memory space automatically, whereas the long-lifetime data are stored in the radom access memory for several time references. The HCC and the LDO are used in a CGRA core called as reconfigurable processing unit (RPU). Two RPUs are integrated in a reconfigurable computing processor (RCP) called as REconfigurable MUlti-media System, High-Performance Processor (REMUS_HPP). Because of the HCC, compared with a traditional nonhierarchical system, the total context storage required in H.264 decoding is reduced by 77%. Because of the LDO, the normalized on-chip data memory size at same performance level in the REMUS_HPP is only 23.8% and 14.8% of those in XPP-III (a high-performance RCP) and ADRES (a low-power RCP). REMUS_HPP is implemented on a 48.9-mm 2 silicon with TSMC 65-nm technology, using a 200-MHz working frequency to achieve 1920 × 1088 at 30 fps H.264 high-profile decoding. Compared with XPP-III, the performance of the REMUS_HPP is 1.81× boosted, whereas the energy efficiency is 4.75× higher.

Mixed-granularity Parallel Coarse-Grained Reconfigurable Architecture

Row-based Configuration Mechanism for a 2-D Processing Element Array in Coarse-Grained Reconfigurable Architecture

MDCRA: A Reconfigurable Accelerator Framework for Multiple Dataflow Lanes

A Coarse-Grained Reconfigurable Architecture for Compute-Intensive MapReduce Acceleration

Combining Memory Partitioning and Subtask Generation for Parallel Data Access on CGRAs

HierCGRA: A Novel Framework for Large-Scale CGRA with Hierarchical Modeling and Automated Design Space Exploration

FastCGRA: A Modeling, Evaluation, and Exploration Platform for Large-Scale Coarse-Grained Reconfigurable Arrays

A CGRA Front-end Compiler Enabling Extraction of General Control and Dedicated Operators

Memory-Aware Loop Mapping on Coarse-Grained Reconfigurable Architectures

Automated Design Space Exploration of CGRA Processing Element Architectures using Frequent Subgraph Analysis

DT-CGRA: Dual-track Coarse-Grained Reconfigurable Architecture for Stream Applications

A Survey of Coarse-Grained Reconfigurable Architecture and Design

An Architecture-Agnostic Dataflow Mapping Framework on CGRA

RMP-MEM: A HW/SW Reconfigurable Multi-Port Memory Architecture for Multi-PEA Oriented CGRA.

Enhancing CGRA Efficiency Through Aligned Compute and Communication Provisioning

A Dynamic Partial Reconfigurable CGRA Framework for Multi-Kernel Applications

CREPE: Concurrent Reverse-Modulo-Scheduling and Placement for CGRAs

A Survey on Coarse-Grained Reconfigurable Architectures from a Performance Perspective

Low-Power Loop Parallelization Onto CGRA Utilizing Variable Dual VDD

On-Chip Memory Hierarchy in One Coarse-Grained Reconfigurable Architecture to Compress Memory Space and to Reduce Reconfiguration Time and Data-Reference Time

Map-reduce inspired loop parallelization on CGRA