Abstract:Highly optimized programs are prone to bit rot, where performance quickly becomes suboptimal in the face of new hardware and compiler techniques. In this paper we show how to automatically lift performance-critical stencil kernels from a stripped x86 binary and generate the corresponding code in the high-level domain-specific language Halide. Using Halide’s state-of-the-art optimizations targeting current hardware, we show that new optimized versions of these kernels can replace the originals to rejuvenate the application for newer hardware. The original optimized code for kernels in stripped binaries is nearly impossible to analyze statically. Instead, we rely on dynamic traces to regenerate the kernels. We perform buffer structure reconstruction to identify input, intermediate and output buffer shapes. We abstract from a forest of concrete dependency trees which contain absolute memory addresses to symbolic trees suitable for high-level code generation. This is done by canonicalizing trees, clustering them based on structure, inferring higher-dimensional buffer accesses and finally by solving a set of linear equations based on buffer accesses to lift them up to simple, high-level expressions. Helium can handle highly optimized, complex stencil kernels with input-dependent conditionals. We lift seven kernels from Adobe Photoshop giving a 75% performance improvement, four kernels from IrfanView, leading to 4.97× performance, and one stencil from the miniGMG multigrid benchmark netting a 4.25× improvement in performance. We manually rejuvenated Photoshop by replacing eleven of Photoshop’s filters with our lifted implementations, giving 1.12× speedup without affecting the user experience.

High-performance code generation for stencil computations on GPU architectures

A Portable Framework for Accelerating Stencil Computations on Modern Node Architectures

Scaling and analyzing the stencil performance on multi-core and many-core architectures

Stencil Computations on AMD and Nvidia Graphics Processors: Performance and Tuning Strategies

Evaluation of Programming Models and Performance for Stencil Computation on Current GPU Architectures

GPU Support for Automatic Generation of Finite-Differences Stencil Kernels

Graph-oriented Code Transformation Approach for Register-Limited Stencils on GPUs

Employing polyhedral methods to optimize stencils on FPGAs with stencil-specific caches, data reuse, and wide data bursts

Cache-Friendly Design for Complex Spatially-Variable Coefficient Stencils on Many-Core Architectures

Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code

High-Performance High-Order Stencil Computation on FPGAs Using OpenCL

Casper: Accelerating Stencil Computation using Near-cache Processing

Code Generation for a Variety of Accelerators for a Graph DSL

Optimizing Stencil Code Via Locality Of Computation

Stencil-HMLS: A multi-layered approach to the automatic optimisation of stencil codes on FPGA

SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-based FPGAs

The Pochoir Stencil Compiler

A Comprehensive Framework for Synthesizing Stencil Algorithms on FPGAs Using OpenCL Model

HW/SW Co-Optimization for Stencil Computation: Beginning with a Customizable Core

Locality of Computation for Stencil Optimization

A Framework for Iterative Stencil Algorithm Synthesis on FPGAs from OpenCL Programming Model (abstract Only).