Abstract:Large-scale supercomputers equipped with GPUs as accelerators are potential to satisfy the future Exascale computing. In this work the solution of large and sparse linear systems of equations by using the Krylov subspace methods, which is crucial for the overall performance of many industrial and scientific applications, is chosen to be accelerated by GPUs’ greatly enlarged computing power. To fulfill this objective on the target hardware with a large amount of heterogeneous computing nodes, two main contributions are included in this work. First we propose a communication avoiding variant of the BICGStab solution method which reduces the global synchronization points per iteration from 3 in the classical BICGStab method to 1 in the improved variant. The superiority in terms of a reduction of the expensive global communications via all computing processes can be expected on a large-scale distributed memory cluster. Second, to handle the host-to-accelerator data transfers, the main challenge encountered in the usage of heterogeneous architecture, a communication overlapped implementation of the sparse matrix–vector multiplication is proposed since this kernel features heavily in the Krylov subspace methods. Linear systems of equations arising from the incompressible Navier–Stokes equations are used to evaluate the proposed solution and optimization methods. Evaluations of the GPU and CPU implementations are conducted on up to 256 GPUs and 4096 CPU cores, respectively. It is revealed that to obtain the same computation time a two times reduction of the number of computing nodes is achieved by using the GPU implementation on the heterogeneous node equipped with 4 GPUs and a 32-core CPU. This result can be seen as the advantage of the heterogeneous architecture from the view point of applications, which motivates a wide utilization in other related areas.

Parallel Shift-Invert Spectrum Slicing on Distributed Architectures with GPU Accelerators.

A Shift Selection Strategy for Parallel Shift-Invert Spectrum Slicing in Symmetric Self-Consistent Eigenvalue Computation

An Efficient Parallel Krylov-Schur Method for Eigen-Analysis of Large-Scale Power Systems

Parallel Eigenvalue Calculation Based On Multiple Shift-Invert Lanczos And Contour Integral Based Spectral Projection Method

A Novel Fully Hardware-Implemented SVD Solver Based on Ultra-Parallel BCV Jacobi Algorithm

Extracting the Potential of Emerging Hardware Accelerators for Symmetric Eigenvalue Decomposition

Numerical eigen-spectrum slicing, accurate orthogonal eigen-basis, and mixed-precision eigenvalue refinement using OpenMP data-dependent tasks and accelerator offload

A Distributed Block Chebyshev-Davidson Algorithm for Parallel Spectral Clustering

Solving Dense Generalized Eigenproblems on Multi-threaded Architectures

Orthogonal layers of parallelism in large-scale eigenvalue computations

Accelerating large partial EVD/SVD calculations by filtered block Davidson methods

A Communication-Avoiding Parallel Algorithm for the Symmetric Eigenvalue Problem

Towards a Heterogeneous Architecture Solver for the Incompressible Navier–Stokes Equations

Accelerating Nonlinear Inversion Algorithms on Gpu Platform for Electromagnetic Data

Accelerating an Iterative Eigensolver for Nuclear Structure Configuration Interaction Calculations on GPUs Using OpenACC

Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems

Towards Accelerating Irregular EDA Applications with GPUs.

Accelerated Subspace Iteration with Aggressive Shift

K-way spectral graph partitioning for load balancing in parallel computing

Advancing the distributed Multi-GPU ChASE library through algorithm optimization and NCCL library