Abstract:Coarse-Grained Reconfigurable Arrays (CGRAs) are promising to have low power consumption and high energy-efficiency characteristics as accelerators. Recent years, many research works focus on improving the programmability of the CGRAs by enabling the fast reconfiguration during execution. The performance of these CGRAs critically hinges upon the scheduling power of the compiler. One of the critical challenges is to reduce memory access conflicts using static compilation techniques. Memory accessing conflict brings the synchronization overhead which causes the pipelining stall and reduces CGRA performance. Existing compilers usually tackle this challenge by orchestrating the data placement of the on-chip global memory (OGM) in CGRA to let the parallel memory accesses avoid the bank conflict. However, we find bank conflict is not the only reason that causes the memory access conflicts. In some CGRAs, the bandwidth of the data network between OGM and processing element array (PEA) is also limited due to the low power design principle. The unbalanced network bandwidth loads is another reason that causes memory access conflicts. Furthermore, the redundant data access across iterations is one of the primary causes of memory access conflicts. Based on these observations, we provide a comprehensive and generalized compilation flow to reduce the memory conflicts. Firstly, we develop a loop transformation model to maximize the inter-iteration data reuse of the loops to reduce the memory accessing operations under the software pipelining scheme. Secondly, we enhance the bandwidth utilization of the network between OGM and PEA and avoid the bank conflict by providing a conflict-aware spatial mapping algorithm which can be easily integrated into existing CGRA modulo scheduling compilation flow. Experimental results show our method is capable of improving performance by an average of 44% comparing with state-of-the-art CGRA compiling flow.

Memory-Aware Loop Paralleling for Coarse-Grained Reconfigurable Architectures

Mapping Loops of multimedia algorithms for Coarse-grained reconfigurable architectures

Memory-Aware Loop Mapping on Coarse-Grained Reconfigurable Architectures

Reducing Memory Access Conflicts with Loop Transformation and Data Reuse on Coarse-grained Reconfigurable Architecture

Combining Memory Partitioning and Subtask Generation for Parallel Data Access on CGRAs

Optimizing Spatial Mapping of Nested Loop for Coarse-Grained Reconfigurable Architectures

Map-reduce inspired loop parallelization on CGRA

Low-Power Loop Parallelization Onto CGRA Utilizing Variable Dual VDD

Exploiting Outer Loop Parallelism of Nested Loop on Coarse-Grained Reconfigurable Architectures

Data parallelism optimization for the CGRA loop pipelining mapping

An Automatic Parallelizer For Coarse-Grained Reconfigurable Processor

MapReduce Inspired Loop Mapping for Coarse-Grained Reconfigurable Architecture

Critical Loop Memory-Aware Mapping Onto Coarse-Grained Reconfigurable Architecture

Mixed-granularity Parallel Coarse-Grained Reconfigurable Architecture

Mapping Loops onto Coarse-Grained Reconfigurable Array Using Genetic Algorithm.

Similarity-Aware Architecture/Compiler Co-Designed Context-Reduction Framework for Modulo-Scheduled CGRA

Optimizing Data Reuse for Loop Mapping on CGRAs with Joint Affine and Non-Affine Transformations

Improving Nested Loop Pipelining on Coarse-Grained Reconfigurable Architectures

Optimizing Data Reuse for CGRA Mapping Using Polyhedral-based Loop Transformations

Loop Acceleration by Cluster-Based CGRA

Mapping Multi-Level Loop Nests Onto CGRAs Using Polyhedral Optimizations.