Cache-oblivious Wavefront: Improving Parallelism of Recursive Dynamic Programming Algorithms Without Losing Cache-Efficiency

Yuan Tang,Ronghui You,Haibin Kan,Jesmin Jahan Tithi,Pramod Ganapathi,Rezaul A. Chowdhury
DOI: https://doi.org/10.1145/2688500.2688514
2015-01-01
ACM SIGPLAN Notices
Abstract:State-of-the-art cache-oblivious parallel algorithms for dynamic programming (DP) problems usually guarantee asymptotically optimal cache performance without any tuning of cache parameters, but they often fail to exploit the theoretically best parallelism at the same time. While these algorithms achieve cache-optimality through the use of a recursive divide-and-conquer (DAC) strategy, scheduling tasks at the granularity of task dependency introduces artificial dependencies in addition to those arising from the defining recurrence equations. We removed the artificial dependency by scheduling tasks ready for execution as soon as all its real dependency constraints are satisfied, while preserving the cache-optimality by inheriting the DAC strategy. We applied our approach to a set of widely known dynamic programming problems, such as Floyd-Warshall's All-Pairs Shortest Paths, Stencil, and LCS. Theoretical analyses show that our techniques improve the span of 2-way DAC-based Floyd Warshall's algorithm on an $n$ node graph from $Thn^2n$ to $Thn$, stencil computations on a $d$-dimensional hypercubic grid of width $w$ for $h$ time steps from $Th(d^2 h) w^ (d+2) - 1$ to $Thh$, and LCS on two sequences of length $n$ each from $Thn^_2 3$ to $Thn$. In each case, the total work and cache complexity remain asymptotically optimal. Experimental measurements exhibit a $3$ - $5$ times improvement in absolute running time, $10$ - $20$ times improvement in burdened span by Cilkview, and approximately the same L1/L2 cache misses by PAPI.
What problem does this paper attempt to address?