Making Halide Efficient for Multicore Systems

Yu Zhang,Yuxiang Zhang
DOI: https://doi.org/10.1109/BIGCOM.2018.00041
2018-01-01
Abstract:Graphics and imaging applications, especially emerging medical imaging or self-driving cars, demand orders of magnitude more computation, and require efficient image processing pipeline implementations. Halide is domain-specific language and compiler designed to make it easier write high performance image processing code. It separates what is being computed from how to compute, enabling programmers to tune to find a high performance schedule. However, Halide applications suffer poor scalability on large number of CPU cores. In this paper, we first perform a detailed parallelism and performance analysis of Halide applications on a 4x8-physical core Linux system. Our analysis shows that page faults and contention on shared address space as well as unaligned packed data movements are part of reasons for poor scalability. To improve the performance of Halide, we propose a novel thread model-Sthread-and adopt it to reimplement Halide runtime, i.e. Shalide, without modifying the Halide language, compiler and the runtime interface. The Sthread model allows each thread has its own memory management structure, and directly shares heap objects, globals and stack data with other threads. On six image processing benchmarks, Shalide shows better speedup and gives up to 1.53x faster than the original Halide.
What problem does this paper attempt to address?