Abstract:Modern graphics processing units (GPUs) are delivering tremendous computing horsepower by running tens of thousands of threads concurrently. The massively parallel execution model has been effective to hide the long latency of off-chip memory accesses in graphics and other general computing applications exhibiting regular memory behaviors. With the fast-growing demand for general purpose computing on GPUs (GPGPU), GPU workloads are becoming highly diversified, and thus requiring a synergistic coordination of both computing and memory resources to unleash the computing power of GPUs. Accordingly, recent graphics processors begin to integrate an on-die level-2 (L2) cache. The huge number of threads on GPUs, however, poses significant challenges to L2 cache design. The experiments on a variety of GPGPU applications reveal that the L2 cache may or may not improve the overall performance depending on the characteristics of applications. In this paper, we propose efficient techniques to improve GPGPU performance by orchestrating both L2 cache and memory in a unified framework. The basic philosophy is to exploit the temporal locality among the massive number of concurrent memory requests and minimize the impact of memory divergence behaviors among simultaneously executed groups of threads. Our major contributions are twofold. First, a priority-based cache management is proposed to maximize the chance of frequently revisited data to be kept in the cache. Second, an effective memory scheduling is introduced to reorder memory requests in the memory controller according to the divergence behavior for reducing average waiting time of warps. Simulation results reveal that our techniques enhance the overall performance by 10% on average for memory intensive benchmarks, whereas the maximum gain can be up to 30%.

Efficient Concurrent L1-Minimization Solvers on GPUs.

Efficient Kernel Management on GPUs.

Iterative Methods in GPU-Resident Linear Solvers for Nonlinear Constrained Optimization

A New Hybrid GPU-CPU Sparse LDLT Factorization Algorithm with GPU and CPU Factorizing Concurrently

On Parallel Solution of Sparse Triangular Linear Systems in CUDA

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

Fast L(1)-Minimization And Parallelization For Face Recognition

Parallel optimization for sparse matrix-vector on GPU

Parallel L-BFGS-B Algorithm on GPU.

A New Hybrid GPU-CPU Sparse LDL T Factorization Algorithm with GPU and CPU Factorizing Concurrently

Fast ℓ1-minimization and parallelization for face recognition.

A Highly Efficient GPU-CPU Hybrid Parallel Implementation of Sparse LU Factorization

Generalized GPU Acceleration for Applications Employing Finite-Volume Methods.

Generating Approximate Inverse Preconditioners for Sparse Matrices Using CUDA and GPGPU

Simultaneous Solving of Batched Linear Programs on a GPU

Optimizing Finite Volume Method Solvers on Nvidia GPUs.

Efficient GPU Spatial-Temporal Multitasking

Orchestrating Cache Management and Memory Scheduling for GPGPU Applications.

Batched sparse direct solver design and evaluation in SuperLU_DIST

Improving Dense Linear Equation Solver on Hybrid CPU-GPU System.

Accelerating Sparse Approximate Matrix Multiplication on GPUs