Abstract:SUMMARY For software to fully exploit the computing power of emerging heterogeneous computers, not only must the required computational kernels be optimized for the specific hardware architectures but also an effective scheduling scheme is needed to utilize the available heterogeneous computational units and to hide the communication between them. As a case study, we develop a static scheduling scheme for the tridiagonalization of a symmetric dense matrix on multicore CPUs with multiple graphics processing units (GPUs) on a single compute node. We then parallelize and optimize the Basic Linear Algebra Subroutines (BLAS)‐2 symmetric matrix‐vector multiplication, and the BLAS‐3 low rank symmetric matrix updates on the GPUs. We demonstrate the good scalability of these multi‐GPU BLAS kernels and the effectiveness of our scheduling scheme on twelve Intel Xeon processors and three NVIDIA GPUs. We then integrate our hybrid CPU‐GPU kernel into computational kernels at higher‐levels of software stacks, that is, a shared‐memory dense eigensolver and a distributed‐memory sparse eigensolver. Our experimental results show that our kernels greatly improve the performance of these higher‐level kernels, not only reducing the solution time but also enabling the solution of larger‐scale problems. Because such symmetric eigenvalue problems arise in many scientific and engineering simulations, our kernels could potentially lead to new scientific discoveries. Furthermore, these dense linear algebra algorithms present algorithmic characteristics that can be found in other algorithms. Hence, they are not only important computational kernels on their own but also useful testbeds to study the performance of the emerging computers and the effects of the various optimization techniques. Copyright © 2013 John Wiley & Sons, Ltd.

Gpu Acceleration of Finding Maximum Eigenvalue of Positive Matrices

Accelerating Parallel Jacobi Method for Matrix Eigenvalue Computation in DOA Estimation Algorithm

Generating Approximate Inverse Preconditioners for Sparse Matrices Using CUDA and GPGPU

GPU Implementation for Solving Eigenvalues of a Matrix

Parallel multiple nonnegative matrices factorization using graphics processing unit

Acceleration of Approximate Matrix Multiplications on GPUs

A Parallel Preconditioned Power Method For The Maximum Eigenvalue Of Real Symmetric Matrices

Gpusgd: A Gpu-Accelerated Stochastic Gradient Descent Algorithm for Matrix Factorization

GPU Accelerated Parallel Cholesky Factorization

Optimized Computation for Determinant of Multivariate Polynomial Matrices on GPGPU

CUDA-based PCG algorithm optimization for a large sparse matrix

Towards Optimal Fast Matrix Multiplication on CPU-GPU Platforms

Parallel singular value decomposition on heterogeneous multi-core and multi-GPU platforms

High Performance Matrix Multiplication on General Purpose Graphics Processing Units

Extracting the Potential of Emerging Hardware Accelerators for Symmetric Eigenvalue Decomposition

A Fast Parallel Matrix Inversion Algorithm Based on Heterogeneous Multicore Architectures

Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems

A Highly Efficient GPU-CPU Hybrid Parallel Implementation of Sparse LU Factorization

A Hybrid CPU-GPU Multifrontal Optimizing Method in Sparse Cholesky Factorization

On Parallel Solution of Sparse Triangular Linear Systems in CUDA

GPU-based multifrontal optimizing method in sparse Cholesky factorization