A Memory Access Reduced Sort on Multi-core GPU

Chengxin Guo,Hong Chen,Cuiping Li,Tianzhen Wu
DOI: https://doi.org/10.1109/hpcc/smartcity/dss.2018.00108
2018-01-01
Abstract:In recent years, many cores architecture co-processors have become the main trend in high-performance computing area due to their powerful parallel computing capability. GPU is one of these promising high-performance computing co-processors and has been used in various applications for improvement. Sorting is a fundamental operation on computing area which can be used in a variety of applications. High efficient sorting algorithms can be achieved with the use of GPUs as well as challenges. On the one hand, GPU is a memory access sensitive hardware. On the other hand, many access on device memory are required in stages of sorting on GPU. In order to reduce the times of accessing device memory of sorting, we propose a Memory Access Reduced hybrid sort, an approach combining memory efficient radix sort and local sort. The local sort is implemented in two different ways. One is multiway bitonic sorting network (MAR-MBSN) in which the warp-shuffle instructions are taken advantage of. Another is radix and place sort (MAR-RPS). As the increasing require of high performance computing, multi-core GPUs have emerged. To fully use of the computing resources of GPU, two methods are applied for implementing our approach on dual-cores GPU. Experiments show that, compared to the CUB library sorting, MAR-MBSN has an up to 44% improvement and MAR-RPS gains an up to 3-fold speedup in local sort. As a result, MAR-MBSN and MAR-RPS achieves a speed up of no less than a factor of 1.5 and 1.66 respectively. Besides, multi-core GPU is introduced to implement our approaches and the factors relative to performance are analyzed.
What problem does this paper attempt to address?