Abstract:Large-scale supercomputers equipped with GPUs as accelerators are potential to satisfy the future Exascale computing. In this work the solution of large and sparse linear systems of equations by using the Krylov subspace methods, which is crucial for the overall performance of many industrial and scientific applications, is chosen to be accelerated by GPUs’ greatly enlarged computing power. To fulfill this objective on the target hardware with a large amount of heterogeneous computing nodes, two main contributions are included in this work. First we propose a communication avoiding variant of the BICGStab solution method which reduces the global synchronization points per iteration from 3 in the classical BICGStab method to 1 in the improved variant. The superiority in terms of a reduction of the expensive global communications via all computing processes can be expected on a large-scale distributed memory cluster. Second, to handle the host-to-accelerator data transfers, the main challenge encountered in the usage of heterogeneous architecture, a communication overlapped implementation of the sparse matrix–vector multiplication is proposed since this kernel features heavily in the Krylov subspace methods. Linear systems of equations arising from the incompressible Navier–Stokes equations are used to evaluate the proposed solution and optimization methods. Evaluations of the GPU and CPU implementations are conducted on up to 256 GPUs and 4096 CPU cores, respectively. It is revealed that to obtain the same computation time a two times reduction of the number of computing nodes is achieved by using the GPU implementation on the heterogeneous node equipped with 4 GPUs and a 32-core CPU. This result can be seen as the advantage of the heterogeneous architecture from the view point of applications, which motivates a wide utilization in other related areas.

Performance evaluation and analysis of sparse matrix and graph kernels on heterogeneous processors

Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms

High Performance Computing Via a GPU

Execution of Compound Multi-Kernel OpenCL Computations in Multi-CPU/Multi-GPU Environments

Towards a Heterogeneous Architecture Solver for the Incompressible Navier–Stokes Equations

Kernel concurrency opportunities based on GPU benchmarks characterization

Analyzing Parallelization and Program Performance in Heterogeneous MPSoCs

A New Sparse Matrix Vector Multiplication GPU Algorithm Designed for Finite Element Problems

A Survey on Heterogeneous CPU–GPU Architectures and Simulators

A CPU-GPGPU Scheduler Based on Data Transmission Bandwidth of Workload

ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels

Exploiting Parallelism in the Simulation of General Purpose Graphics Processing Unit Program

Heterogeneous FPGA+GPU Embedded Systems: Challenges and Opportunities

Understanding Data Partition for Applications on CPU-GPU Integrated Processors.

TuNao: A High-Performance and Energy-Efficient Reconfigurable Accelerator for Graph Processing

On Performance Analysis of Graphcore IPUs: Analyzing Squared and Skewed Matrix Multiplication

Performance Portability of Sparse Block Diagonal Matrix Multiple Vector Multiplications on GPUs

Evaluation of Selected Resource Allocation and Scheduling Methods in Heterogeneous Many-Core Processors and Graphics Processing Units

Optimizing the LINPACK Algorithm for Large-Scale PCIe-Based CPU-GPU Heterogeneous Systems

Understanding Co-Running Behaviors on Integrated CPU/GPU Architectures