Abstract:In recent years, multimedia and game applications have experienced rapid growth at an explosive rate both in quantity and complexity. Since these applications typically demand 1010 to 1011 operations to be executed per second, higher processing capability is expected. Therefore, stream processors are becoming popular because of its performance advantages in the domains of signal processing, multimedia and etc. To provide sufficient computing capability, multi-SIMD units are employed in the stream processors. Moreover, to overcome the centralized register file constraint, hierarchical register organization is proposed and widely used in stream processors. In upper level of the hierarchy, distributed register file (DRF) becomes the dominant design and there are explicit interconnections among the DRFs managed by the compiler in a VLIW manner. Moreover, in order to further exploit the nice locality characteristics in multimedia applications, the lower level is a multi-banked register file where each bank is accessed by several SIMD units through a shared data bus. We will refer to the architecture with such characteristics as MLRM- SIMD architecture. Although such a design suits for multimedia processing, low data bandwidth caused by the shared data bus between multi-level registers severely impedes the programs mapping to MLRM-SIMD architecture. If this constraint can not be resolved well, the parallelism among multi-SIMD units would also be influenced as well as the whole program performance. Therefore, one of the major challenges to optimizing techniques for MLRM-SIMD architecture is to resolve the shared bus conflicts well. However, when generating the executable codes for MLRM-SIMD architecture, the compiler must simultaneously allocate many interdependent resources: the SIMD units on which the operations take place, the register files to hold the intermediate values and the shared bus to transfer the data for SIMD units. These conditions put very high pressure on the optimizing algorithms. Although read-read reuse operands among different SIMD units can be replicated, replicating all data would increase the register pressure and data spilling and reloading would also lead to the data bus conflicts. As there are interconnections among DRFs, some optimizations can be performed to improve the low data bandwidth. If the data is not required by many different SIMD units at the same time, they can be loaded by one SIMD unit only and transferred to other SIMD units through the interconnection between registers while being required. Such a manner would bring no data bus conflict and less local register pressure because only one data copy is saved in the DRFs. It is the major motivation of our optimizing algorithm. In this paper, we present a novel data pipeline optimization through communicating read-read source operands among different SIMD units to reduce the shared data bus conflicts. We use loop level parallelism to exploit the multi- SIMD computing ability and use read-read reuse communication to reduce the data bus conflicts. In contrast, traditional pipeline algorithms usually exploit pipeline parallelism and communicate computing results of prior stages. Therefore, we refer to our algorithm as data pipeline scheduling to distinguish it from traditional pipeline algorithm. Such a policy not only reduces the bus conflict, but also releases the register pressure. Experimental results show that it improves performance by 12% over the traditional method based on replication only. Based on the experimental results, some advice on programming is proposed for MLRM-SIMD architecture. When writing programs for MLRM-SIMD architecture, it is better to maintain the original structure in the algorithms, which would be much easier for compilers to exploit the parallelism in the programs and thus generate more efficient executable codes.

A One-for-All and <i>O</i>(<i>V</i> log(<i>V</i>))-Cost Solution for Parallel Merge Style Operations on Sorted Key-Value Arrays

Near-Memory Parallel Indexing and Coalescing: Enabling Highly Efficient Indirect Access for SpMV

Efficient Algorithm Design of Optimizing SpMV on GPU.

Parallel Photonic Acceleration Processor for Matrix-Matrix Multiplication

A Hybrid Vectorized Merge Sort on ARM NEON

MeNDA: A Near-Memory Multi-way Merge Solution for Sparse Transposition and Dataflows

Esspmv: an Embedded-FPGA-based Hardware Accelerator for Symmetric Sparse Matrix-Vector Multiplication.

Chimera: an Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion

Optimizing Bandwidth Constraint Through Register Interconnection for Stream Processors

Towards Efficient SpMV on Sunway Manycore Architectures.

IOPS: An Unified SpMM Accelerator Based on Inner-Outer-Hybrid Product

Quartet: A 22nm 0.09mj/lnference Digital Compute-in-Memory Versatile AI Accelerator with Heterogeneous Tensor Engines and Off-Chip-Less Dataflow

A Comprehensive Performance Model of Sparse Matrix-Vector Multiplication to Guide Kernel Optimization

A sparse matrix vector multiplication accelerator based on high-bandwidth memory

Accelerating Unstructured SpGEMM using Structured In-situ Computing

A Data Locality-Aware Design Framework For Reconfigurable Sparse Matrix-Vector Multiplication Kernel

PriorMSM: An Efficient Acceleration Architecture for Multi-Scalar Multiplication

The Implementation and Optimization of Parallel Linpack on Multi-Core Vector Accelerator

An Out-of-Core Dataflow Middleware to Reduce the Cost of Large Scale Iterative Solvers

AMOS: enabling <u>a</u>utomatic <u>m</u>apping for tensor computations <u>o</u>n <u>s</u>patial accelerators with hardware abstraction

DRAM-Based Acceleration of Open Modification Search in Hyperdimensional Space