Engineering Worst-Case Inputs for Pairwise Merge Sort on GPUs

Kyle Berney,Nodari Sitchinava
DOI: https://doi.org/10.1109/ipdps47924.2020.00119
2020-05-01
Abstract:Currently, the fastest comparison-based sorting implementation on GPUs is implemented using a parallel pairwise merge sort algorithm (Thrust library). To achieve fast runtimes, the number of threads t to sort the input of N elements is fine-tuned experimentally for each generation of Nvidia GPUs in such a way that the number of elements E = N/t that each thread accesses in each merging round results in a small (empirically measured) number of shared memory contentions, known as bank conflicts, while balancing the number of global memory accesses and latency-hiding through thread oversubscription/occupancy. In this paper, we show that for every choice of E < w, such that E and w are co-prime, there exists an input permutation on which every warp of w threads of the Thrust merge sort is effectively reduced to using at most ⌈w/E⌉ threads due to sequentialization of shared memory accesses due to bank conflicts. Note that this matches the trivial worst-case bound on the loss of parallelism due to memory contentions for any warp accessing wE contiguous shared memory locations. Our proof is constructive, i.e., we are able to automatically construct such permutation for every value of E. We also show in practice that such constructed inputs result in up to ~50% slowdown, compared to the performance on random inputs, on modern GPU hardware.
What problem does this paper attempt to address?