Abstract:Highly optimized programs are prone to bit rot, where performance quickly becomes suboptimal in the face of new hardware and compiler techniques. In this paper we show how to automatically lift performance-critical stencil kernels from a stripped x86 binary and generate the corresponding code in the high-level domain-specific language Halide. Using Halide’s state-of-the-art optimizations targeting current hardware, we show that new optimized versions of these kernels can replace the originals to rejuvenate the application for newer hardware. The original optimized code for kernels in stripped binaries is nearly impossible to analyze statically. Instead, we rely on dynamic traces to regenerate the kernels. We perform buffer structure reconstruction to identify input, intermediate and output buffer shapes. We abstract from a forest of concrete dependency trees which contain absolute memory addresses to symbolic trees suitable for high-level code generation. This is done by canonicalizing trees, clustering them based on structure, inferring higher-dimensional buffer accesses and finally by solving a set of linear equations based on buffer accesses to lift them up to simple, high-level expressions. Helium can handle highly optimized, complex stencil kernels with input-dependent conditionals. We lift seven kernels from Adobe Photoshop giving a 75% performance improvement, four kernels from IrfanView, leading to 4.97× performance, and one stencil from the miniGMG multigrid benchmark netting a 4.25× improvement in performance. We manually rejuvenated Photoshop by replacing eleven of Photoshop’s filters with our lifted implementations, giving 1.12× speedup without affecting the user experience.

Making Halide Efficient for Multicore Systems

Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code

Programming Heterogeneous Systems from an Image Processing DSL

Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators

Compiling Halide Programs to Push-Memory Accelerators

A Method for Efficient Heterogeneous Parallel Compilation: A Cryptography Case Study

Medical Image Viewing on Multicore Platforms Using Parallel Computing Patterns.

Medical Image Viewing on Multi-Core Platforms Using Software Patterns for Parallel Computing

Novel many-core architecture design for real-time image processing

Optimizing Compiler for Shared-Memory Multiple Simd Architecture

HW/SW Co-Optimization for Stencil Computation: Beginning with a Customizable Core

Simultaneous and Heterogenous Multithreading: Exploiting Simultaneous and Heterogeneous Parallelism in Accelerator-Rich Architectures

Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architecture

MIMD Programs Execution Support on SIMD Machines: A Holistic Survey

High-performance computing: Transitioning from Instruction-Level Parallelism to heterogeneous hybrid architectures

A Hierarchical Grid Algorithm for Accelerating High-Performance Conjugate Gradient Benchmark on Sunway Many-Core Processor

Hyperion: A Generic and Distributed Mobile Offloading Framework on OpenCL.

Characterizing Fine-Grain Parallelism on Modern Multicore Platform

Gaining Cross-Platform Parallelism for HAL's Molecular Dynamics Package using SYCL

Parallel SHA-256 on SW26010 Many-Core Processor for Hashing of Multiple Messages.

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture