Improving Parallelism of Recursive Stencil Computations without Sacrificing Cache Performance

Yuan Tang,Ronghui You,Haibin Kan,Jesmin Jahan Tithi,Pramod Ganapathi,Rezaul A. Chowdhury
DOI: https://doi.org/10.1145/2686745.2686752
2014-01-01
Abstract:The state-of-the-art \"trapezoidal decomposition algorithm\" for stencil computations on modern multicore machines use recursive divide-and-conquer (DAC) to achieve asymptotically optimal cache complexity cache-obliviously. But the same DAC approach restricts parallelism by introducing artificial dependencies among subtasks in addition to those arising from the defining stencil equations. As a result, the trapezoidal decomposition algorithm has suboptimal parallelism. In this paper we present a variant of the parallel trapezoidal decomposition algorithm called \"cache-oblivious wavefront\" (COW) that starts execution of recursive subtasks earlier than the start time prescribed by the original algorithm without violating any real dependencies implied by the underlying recurrences, and thus reducing serialization due to artificial dependencies. The reduction in serialization leads to an improvement in parallelism. Moreover, since we do not change the DAC-based decomposition of tasks used in the original algorithm, cache performance does not suffer. We provide experimental measurements of absolute running times, burdened span by Cilkview, and L1/L2 cache misses by PAPI to validate our claims.
What problem does this paper attempt to address?