Abstract:Matrix decomposition is a fundamental operation in linear algebra, and it has various applications in machine learning, signal processing, edge computing, and many other fields. Singular Value Decomposition (SVD) is a matrix decomposition method that can break down a matrix into three matrices: two orthogonal matrices and a diagonal matrix. With the development of domestic high-performance Digital Signal Value Processors (DSP), the demand for matrix computation based on DSP platforms is increasing. The research of SVD implemented based on DSP is important and meaningful. However, accessing the high-performance algorithm requires developers who are familiar with the hardware characteristics, in order to combine the unique features of the algorithm with the limited hardware resources. To reduce the cost of computing the SVD in matrix, we implement a vectorization mapping method for the SVD algorithm on the FT-M7002. The single instruction multiple data (SIMD) instructions embedded in the FT-M7002 processor were utilized to exploit the data-level parallelism in the SVD algorithm. Instead of using data movement and a scalar processing unit (SPU), we compute with a single vector processing element (VPE). Additionally, DMA transfer algorithm is designed to implement matrix transposition and resolve the issue of discontinuous data access. Experimental results show that the optimized SVD algorithm improves execution performance relative to the original SVD algorithm on FT by up to 5.0 ×. Furthermore, we demonstrate that the optimized SVD algorithm on the FT-M7002 performs 1.0-2.0× faster than the optimized SVD algorithm on TMS320C6678 processor.

Sparse Matrix-Vector Multiply on the Texas Instruments C6678 Digital Signal Processor

Optimizing General Matrix Multiplications on Modern Multi-core DSPs

Advancing Matrix Decomposition Efficiency: A Study on FT-Matrix DSP Based SVD Optimization.

Fast AVS Prediction Residual and Integer DCT Implementations for VLIW DSP

Efficiently Running SpMV on Multi-core DSPs for Banded Matrix

Vectorizable Design and Implementation of Matrix Multiplication on Vector Processor

Highly Paralleled Low-Cost Embedded HEVC Video Encoder on TI KeyStone Multicore DSP

Design of a Reconfigurable Coprocessor for Double Precision Floating Point Matrix Algorithms

Parallel Processing Of Mimo Radar Algorithm On Multi-Core Digital Signal Processor

Optimizing SpMV on Heterogeneous Multi-Core DSPs Through Improved Locality and Vectorization

Optimization of Quasi-Diagonal Matrix-Vector Multiplication on Gpu

Research of SAR Signal Processor Based on Multicore DSP 6678

DSPIMM: A Fully Digital SParse In-Memory Matrix Vector Multiplier for Communication Applications

Advancing DSP into HPC, AI, and Beyond: Challenges, Mechanisms, and Future Directions.

Optimizing sparse general matrix–matrix multiplication for DCUs

DSP Based Acceleration for Long Short-Term Memory Model Based Word Prediction Application

Optimizing sparse matrix-vector multiplication based on gpu

A Data Locality-Aware Design Framework For Reconfigurable Sparse Matrix-Vector Multiplication Kernel

Performance Optimization of Deep Learning Sparse Matrix Kernels on Intel Max Series GPU

Design and Implementation of Floating-Point Multiply-Accumulate Processing Element under SMVM System

A Super-Vector Deep Learning Coprocessor with High Performance-Power Ratio.