Abstract:The coarse-grained reconfigurable architecture (CGRA) is proven to be energy efficient in several specific domains. In CGRAs, the on-chip memory hierarchy, which contains the context memory and the data memory organizations, should be well considered to achieve appropriate tradeoffs among three aspects: 1) performance; 2) area; and 3) power. In this paper, two techniques called the hierarchical configuration context (HCC) and the lifetime-based data-memory organization (LDO) focusing on the context memory and the data memory organizations are proposed to compress the on-chip memory space and to reduce the reconfiguration time and the data-reference time. In the HCC, the contexts are constructed in a hierarchical fashion to completely eliminate the repetitive portions of the contexts, not only reducing the overall context storage, but also alleviating the context transportation overhead. A fast context-indexing mechanism in the HCC is proposed to achieve fast reconfiguration, as the hierarchically organized contexts can be located and accessed conveniently. In the LDO, the on-chip data are classified into two types, based on the lifetime of data. The short-lifetime data are stored in the first in first out to increase the reuse ratio of memory space automatically, whereas the long-lifetime data are stored in the radom access memory for several time references. The HCC and the LDO are used in a CGRA core called as reconfigurable processing unit (RPU). Two RPUs are integrated in a reconfigurable computing processor (RCP) called as REconfigurable MUlti-media System, High-Performance Processor (REMUS_HPP). Because of the HCC, compared with a traditional nonhierarchical system, the total context storage required in H.264 decoding is reduced by 77%. Because of the LDO, the normalized on-chip data memory size at same performance level in the REMUS_HPP is only 23.8% and 14.8% of those in XPP-III (a high-performance RCP) and ADRES (a low-power RCP). REMUS_HPP is implemented on a 48.9-mm 2 silicon with TSMC 65-nm technology, using a 200-MHz working frequency to achieve 1920 × 1088 at 30 fps H.264 high-profile decoding. Compared with XPP-III, the performance of the REMUS_HPP is 1.81× boosted, whereas the energy efficiency is 4.75× higher.

A Comprehensive Reconfigurable Computing Approach to Memory Wall Problem of Large Graph Computation

A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration

An optimized architecture for accelerating graph computing on FPGAs

Boosting the Performance of FPGA-based Graph Processor Using Hybrid Memory Cube: A Case for Breadth First Search.

An Efficient Graph Accelerator with Distributed On-Chip Memory Hierarchy.

SoGraph: A State-Aware Architecture for Out-of-Memory Graph Processing on HBM-Equipped FPGAs

GraphH: A Processing-in-Memory Architecture for Large-Scale Graph Processing

On-Chip Memory Hierarchy in One Coarse-Grained Reconfigurable Architecture to Compress Memory Space and to Reduce Reconfiguration Time and Data-Reference Time

Scalable Multi-FPGA HPC Architecture for Associative Memory System

ReGraph: Scaling Graph Processing on HBM-enabled FPGAs with Heterogeneous Pipelines

A Novel ReRAM-based Processing-in-memory Architecture for Graph Computing.

An Efficient ReRAM-based Accelerator for Asynchronous Iterative Graph Processing

Cost-Effective Memory Architecture to Achieve Flexible Configuration and Efficient Data Transmission for Coarse-Grained Reconfigurable Array (Abstract Only).

An Edge Re-Ordering Based Acceleration Architecture for Improving Data Locality in Graph Analytics Applications

Memory System Optimization for Graph Processing: a Survey

An STT-MRAM Based Reconfigurable Computing-in-memory Architecture for General Purpose Computing

ScalaGraph: A Scalable Accelerator for Massively Parallel Graph Processing

Graphsar: A Sparsity-Aware Processing-In-Memory Architecture For Large-Scale Graph Processing On Rerams

GraphR: Accelerating Graph Processing Using ReRAM

Foregraph: Exploring Large-Scale Graph Processing On Multi-Fpga Architecture

A Task-Adaptive In-Situ ReRAM Computing for Graph Convolutional Networks