Abstract:GPU’s powerful parallel processing capability has been highly recognized throughout the industry; however, GPU computing environments have not yet been widely used in the field of parallel computing. In this study, we develop a method of parallelization of serial programs for GPU computing. In particular, we propose an approach called PRODA to speedup parallel programs on GPUs through dependency analysis. PRODA provides theoretical underpins of task partitioning in parallel programs running in GPU computing environments. At the heart of PRODA is an analyzer for program workflows as well as data and function dependencies in a GPU program. With the dependency analysis in place, PRODA assigns computing tasks to multiple GPU cores in a way to speedup the performance of parallel program on GPUs. An overarching goal of PRODA is to minimize data communication cost between GPUs and main memory of a host CPU. PRODA achieves this goal by apply deploying two strategies. First, PRODA assigns functions processing the same data to a GPU core. Second, PRODA runs multiple independent functions on separate GPU cores. In doing so, PRODA improves the parallelism of parallel programs. We evaluate the performance of PRODA by running two popular benchmarks (i.e., AES and T26) on an 256-core system, where key length is set to 256 bits. The experimental results show that the speedup ratio of AES governed by PRODA is 5.2. Specifically, PRODA improves the performance of the existing CFM scheme by a factor of 1.39. To measure cost of parallel computing, we test PRODA and the alternative solutions by running AES under the 256-bit key length on 128 cores. The cost of parallel computing in PRODA is 524.8ms, which is 61.2% lower than that of the existing SA solution. The parallel efficiency of PRODA is 2.08, which represents an improvement of the PDM algorithm by a factor of 2.08.

Exploiting Online Locality and Reduction Parallelism for Sampled Dense Matrix Multiplication on GPUs

A Row Decomposition-based Approach for Sparse Matrix Multiplication on GPUs.

Predicting the Output Structure of Sparse Matrix Multiplication with Sampled Compression Ratio

Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Accelerating approximate matrix multiplication for near-sparse matrices on GPUs

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Atomic Reduction Based Sparse Matrix-Transpose Vector Multiplication on GPUs

Efficient Algorithm Design of Optimizing SpMV on GPU.

Sparse Matrix-Vector Multiplication Optimizations based on Matrix Bandwidth Reduction using NVIDIA CUDA

Optimizing sparse general matrix–matrix multiplication for DCUs

FastLoad: Speeding Up Data Loading of Both Sparse Matrix and Vector for SpMV on GPUs

Optimizing sparse matrix-vector multiplication based on gpu

TileSpMSpV: A Tiled Algorithm for Sparse Matrix-Sparse Vector Multiplication on GPUs

Efficient GPU implementation of randomized SVD and its applications

Heuristic Adaptability to Input Dynamics for SpMM on GPUs

A Data Locality-Aware Design Framework For Reconfigurable Sparse Matrix-Vector Multiplication Kernel

PRODA: Improving Parallel Programs on GPUs Through Dependency Analysis

Esspmv: an Embedded-FPGA-based Hardware Accelerator for Symmetric Sparse Matrix-Vector Multiplication.

Register-Aware Optimizations for Parallel Sparse Matrix–Matrix Multiplication

Sgap: Towards Efficient Sparse Tensor Algebra Compilation for GPU