Abstract:Benefiting from the advancement of hardware accelerators such as GPUs, deep neural networks and scientific computing applications can achieve superior performance. Recently, the computing capacity of emerging hardware accelerators has increased rapidly, while memory bandwidth has not kept pace with this growth. This disparity exacerbates the gap between computing and memory, leading to inefficiencies on conventional algorithms, as they're likely to be converted from compute-bound to memory-bound. Symmetric eigenvalue decomposition (EVD), a critical operation in various research domains including scientific computing, deep learning training, and inference algorithms, exhibits suboptimal performance due to achieving less than 3\% hardware computing utilization on the H100 GPU. In this paper, we analyze the features of emerging hardware accelerators to identify the bottlenecks inherent in conventional EVD algorithms. To improve EVD performance, we propose several algorithmic optimizations aimed at solving the memory-bound problem and providing a better utilization of the rich computing capacity and parallelism on the emerging hardware accelerators. Experimentally, our proposed method demonstrates significant speedups on tridiagonalization, which is the main workload that takes over 90\% elapsed time of EVD, compared to the SOTA cuSOLVER tridiagonalization, achieving up to 10.1x, 7.5x, and 2.3x improvements on H100, A100, and RTX 4090 GPUs, respectively. And the end-to-end the performance of EVD solver is also up to 4.1x faster than cuSOVLER.

A Novel Fully Hardware-Implemented SVD Solver Based on Ultra-Parallel BCV Jacobi Algorithm

Scalable SVM Processor and Its Application to Nonlinear Channel Equalization

W-Cycle SVD: A Multilevel Algorithm for Batched SVD on GPUs

Extracting the Potential of Emerging Hardware Accelerators for Symmetric Eigenvalue Decomposition

A W-cycle Algorithm for Efficient Batched SVD on GPUs

Implementation of a Parallel Sparse Direct Solver on Vector Architecture

Jacobi solver: A fast FPGA-based engine system for Jacobi method

Hardware implementation of transform and quantization for AVS encoder

A Data Locality-Aware Design Framework For Reconfigurable Sparse Matrix-Vector Multiplication Kernel

A High-Performance VLSI Architecture for CABAC Decoding in H.264/AVC

FPGA and GPU Implementation of Large Scale SpMV

Esspmv: an Embedded-FPGA-based Hardware Accelerator for Symmetric Sparse Matrix-Vector Multiplication.

Limited Memory Block Krylov Subspace Optimization for Computing Dominant Singular Value Decompositions.

An Integral-equation-oriented Vectorized SpMV Algorithm and Its Application on CT Imaging Reconstruction

Hardware Acceleration for the Banded Smith-Waterman Algorithm with the Cycled Systolic Array

svds-C: A Multi-Thread C Code for Computing Truncated Singular Value Decomposition

PRIMME_SVDS: A High-Performance Preconditioned SVD Solver for Accurate Large-Scale Computations

Field Programmable Gate Array (FPGA) Implementation of Parallel Jacobi for Eigen-Decomposition in Direction of Arrival (DOA) Estimation Algorithm

Tucker Tensor Decomposition on FPGA

Development of Krylov and AMG linear solvers for large-scale sparse matrices on GPUs

Parallel (M−N)SVD Algorithms on the SIMD Computers