Abstract:The Matrix2 Accelerator is a high-performance multi-core vector processor for high-density computing that supports fused multiply-add instructions. We propose an efficient large-scale 1D FFT vectorization method according to the architecture characteristics of Matrix2. (1) An FFT vectorization method based on fused multiply-add instruction is proposed to accelerate FFT computation. By transforming the operation flow of FFT butterfly computation, the independent multiplication and addition operations in the traditional FFT computation method are combined into a smaller number of fused multiply-add operations. It reduces the number of the real floating-point operations in radix-2 FFT butterfly computation from original 10 multiplication/addition operations to 6 fused multiply-add operations, and reduces the number of the real floating-point operations in radix-4 FFT butterfly computation from original 34 multiplication/addition operations to 24 fused multiply-add operations. (2) An FFT vectorization method based on matrix Fourier algorithm is designed, which converts 1D FFT computation into 2D FFT computation. It contains three steps: column FFT computation, multiplication of the column FFT computation result and a factor matrix, row FFT computation. These three steps are all vectorized. (3) A factor matrix data layout and updating method is proposed, which can greatly reduce the memory capacity for factor matrix. It can avoid multiple data transmissions between the vector array memory and the global cache by combining the column FFT computation with the factor matrix multiplication, thus significantly improving the computational efficiency of FFT. (4) A double buffering DMA mechanism is adopted to optimize and smooth the data transmission between the multi-level storage structures, and the data transmission time is overlapped with the computation time so as to reduce the total computation time. The experimental results on Matrix2 show that the proposed vectorization method improves the computational efficiency of large-scale 1D FFT by an average of 5.56 times.

Vectorizable Design and Implementation of Matrix Multiplication on Vector Processor

The Implementation and Optimization of Parallel Linpack on Multi-Core Vector Accelerator

Design and implementation of two-dimensional matrix convolution based on vector processor

Efficient Large-Scale 1D FFT Vectorization on Multi-Core Vector Accelerator

Vector Processing Support for FPGA-Oriented High Performance Applications

Optimization of Matrix Multiplication Based on a Multi-Core Architecture Extended with Vector Units

Design of a Reconfigurable Coprocessor for Double Precision Floating Point Matrix Algorithms

Design and Implementation of Floating-Point Multiply-Accumulate Processing Element under SMVM System

SIMD Processor for Realizing BP Neural Network with High Rate

A Reconfigurable Matrix Multiplication Coprocessor with High Area and Energy Efficiency for Visual Intelligent and Autonomous Mobile Robots

Method for realizing heterogeneous many-core of sparse matrix-vector multiplication based on domestic SW26010 processors

Design of Field Programmable Gate Array Based Real-Time Double-Precision Floating-Point Matrix Multiplier

A Universal FPGA-based Floating-Point Matrix Processor for Mobile Systems

High Performance Matrix Multiplication on General Purpose Graphics Processing Units

A sparse matrix vector multiplication accelerator based on high-bandwidth memory

Auto-Tuning Of Thread Assignment For Matrix-Vector Multiplication On Gpus

MX: Enhancing RISC-V's Vector ISA for Ultra-Low Overhead, Energy-Efficient Matrix Multiplication

Matrix Multiplication Based on Scalable Macro-Pipelined FPGA Accelerator Architecture

Optimal Matrix Computing Using Vector Division with Sub-word Parallel

High Performance Matrix Multiplication on Many Cores

Design and Implementation of High Performance Matrix Inversion Based on Reconfigurable Processor