Abstract:In recent years, multimedia and game applications have experienced rapid growth at an explosive rate both in quantity and complexity. Since these applications typically demand 1010 to 1011 operations to be executed per second, higher processing capability is expected. Therefore, stream processors are becoming popular because of its performance advantages in the domains of signal processing, multimedia and etc. To provide sufficient computing capability, multi-SIMD units are employed in the stream processors. Moreover, to overcome the centralized register file constraint, hierarchical register organization is proposed and widely used in stream processors. In upper level of the hierarchy, distributed register file (DRF) becomes the dominant design and there are explicit interconnections among the DRFs managed by the compiler in a VLIW manner. Moreover, in order to further exploit the nice locality characteristics in multimedia applications, the lower level is a multi-banked register file where each bank is accessed by several SIMD units through a shared data bus. We will refer to the architecture with such characteristics as MLRM- SIMD architecture. Although such a design suits for multimedia processing, low data bandwidth caused by the shared data bus between multi-level registers severely impedes the programs mapping to MLRM-SIMD architecture. If this constraint can not be resolved well, the parallelism among multi-SIMD units would also be influenced as well as the whole program performance. Therefore, one of the major challenges to optimizing techniques for MLRM-SIMD architecture is to resolve the shared bus conflicts well. However, when generating the executable codes for MLRM-SIMD architecture, the compiler must simultaneously allocate many interdependent resources: the SIMD units on which the operations take place, the register files to hold the intermediate values and the shared bus to transfer the data for SIMD units. These conditions put very high pressure on the optimizing algorithms. Although read-read reuse operands among different SIMD units can be replicated, replicating all data would increase the register pressure and data spilling and reloading would also lead to the data bus conflicts. As there are interconnections among DRFs, some optimizations can be performed to improve the low data bandwidth. If the data is not required by many different SIMD units at the same time, they can be loaded by one SIMD unit only and transferred to other SIMD units through the interconnection between registers while being required. Such a manner would bring no data bus conflict and less local register pressure because only one data copy is saved in the DRFs. It is the major motivation of our optimizing algorithm. In this paper, we present a novel data pipeline optimization through communicating read-read source operands among different SIMD units to reduce the shared data bus conflicts. We use loop level parallelism to exploit the multi- SIMD computing ability and use read-read reuse communication to reduce the data bus conflicts. In contrast, traditional pipeline algorithms usually exploit pipeline parallelism and communicate computing results of prior stages. Therefore, we refer to our algorithm as data pipeline scheduling to distinguish it from traditional pipeline algorithm. Such a policy not only reduces the bus conflict, but also releases the register pressure. Experimental results show that it improves performance by 12% over the traditional method based on replication only. Based on the experimental results, some advice on programming is proposed for MLRM-SIMD architecture. When writing programs for MLRM-SIMD architecture, it is better to maintain the original structure in the algorithms, which would be much easier for compilers to exploit the parallelism in the programs and thus generate more efficient executable codes.

A Case for a Flexible Scalar Unit in SIMT Architecture

SIMD$^2$: A Generalized Matrix Instruction Set for Accelerating Tensor Computation beyond GEMM

MPU: Towards Bandwidth-abundant SIMT Processor via Near-bank Computing

Simultaneous and Heterogenous Multithreading: Exploiting Simultaneous and Heterogeneous Parallelism in Accelerator-Rich Architectures

Optimizing the performance of Lattice Gauge Theory simulations with Streaming SIMD extensions

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

Exploiting Parallelism in the Simulation of General Purpose Graphics Processing Unit Program

Optimizing Compiler for Shared-Memory Multiple Simd Architecture

MIMD Programs Execution Support on SIMD Machines: A Holistic Survey

Designing and Implementing a Generator Framework for a SIMD Abstraction Library

SIMDify: Framework for SIMD-Processing with RISC-V Scalar Instruction Set

CIS: Composable Instruction Set for Streaming Applications: Design, Modeling, and Scheduling

A Quantitative Evaluation of Vector Transcendental Functions on ARMv8-Based Processors

Optimizing Bandwidth Constraint Through Register Interconnection for Stream Processors

FT-Matrix: A Coordination-Aware Architecture for Signal Processing

SIMD Code Translation in an Enhanced HQEMU

Improving SIMD Parallelism via Dynamic Binary Translation

Instruction Scheduling in the Saturn Vector Unit

A Hybrid Sorting Algorithm on Heterogeneous Architectures

Memory-Constrained Vectorization and Scheduling of Dataflow Graphs for Hybrid CPU-GPU Platforms

MCMG Simulator: A Unified Simulation Framework for CPU and Graphic GPU