Abstract:Tridiagonal system solver is an important kernel in many scientific and engineering applications. Even though quite a few parallel algorithms and implementations have been addressed in recent years, challenges still remain when solving large-scale tridiagonal system on heterogenous supercomputers. In this paper, a hierarchical algorithm framework SPIKE (pronounced 'SPIKE squared') is proposed to minimize the parallel overhead and to achieve the best utilization of CPU-GPU hybrid systems. In these systems, a layered and adaptive partitioning is presented based on the SPIKE algorithm to effectively control the sequential parts while efficiently exploiting the computation and communication overlapping in heterogeneous computing node. Moreover, the SPIKE algorithm is reformulated to reduce the matrix computations to only 1/3 in our hierarchical algorithm framework. Meanwhile, an improved implementation of the tiled-PCR-pThomas algorithm is employed for the GPU architecture, and the shared memory usage on the GPU can be reduced by 1/3 using careful dependence analysis on solving unit vector tridiagonal systems. Our experiments on Tianhe-1A show ideal weak scalability on up to 128 nodes when solving a tridiagonal system with a size of 1920M in the largest run and good strong scalability (70%) from 32 nodes to 256 nodes when solving a tridiagonal system with a size of 480M. Furthermore, the adaptive task partition across the CPU and GPU can get over 10% performance improvement in the strong scaling test with 256 nodes. In one computing node of Tianhe-1A, our GPU-only code can outperform the CUSPARSE version (non-pivoting tridiagonal solver) by 30%, and our hybrid code is about 6.7 times faster than the Intel SPIKE multi-process version for tridiagonal systems having a size of 3M, 5M, and 15M.

Exploiting Hierarchical Parallelism and Reusability in Tensor Kernel Processing on Heterogeneous HPC Systems

Performance Analysis and Optimization for MTTKRP of Sparse Tensor on CPU and GPU

A Heterogeneous Parallel Computing Approach Optimizing SpTTM on CPU-GPU Via GCN

Shared Memory Parallelization of MTTKRP for Dense Tensors

Software for Sparse Tensor Decomposition on Emerging Computing Architectures

Sparse MTTKRP Acceleration for Tensor Decomposition on GPU

A Novel Parallel Algorithm for Sparse Tensor Matrix Chain Multiplication via TCU-Acceleration

Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor Contraction

Analyzing the Performance Portability of Tensor Decomposition

Optimizing Sparse Tensor Times Matrix on GPUs

SpTFS: Sparse Tensor Format Selection for MTTKRP Via Deep Learning

Optimizing Sparse Tensor Times Matrix on Multi-Core and Many-Core Architectures

Input-aware Sparse Tensor Storage Format Selection for Optimizing MTTKRP

swCPD - Optimizing Canonical Polyadic Decomposition on Sunway Manycore Architecture.

Efficient Processing of Sparse Tensor Decomposition via Unified Abstraction and PE-Interactive Architecture

A Hierarchical Tridiagonal System Solver for Heterogenous Supercomputers

Towards efficient canonical polyadic decomposition on sunway many-core processor

A New Hybrid Hierarchical Parallel Algorithm to Enhance the Performance of Large-Scale Structural Analysis Based on Heterogeneous Multicore Clusters

High Performance Unstructured SpMM Computation Using Tensor Cores

Aesptv: an Adaptive and Efficient Framework for Sparse Tensor-Vector Product Kernel on a High-Performance Computing Platform

Releasing the Potential of Tensor Core for Unstructured SpMM Using Tiled-CSR Format