mlirSynth: Automatic, Retargetable Program Raising in Multi-Level IR using Program Synthesis

Alexander Brauckmann,Elizabeth Polgreen,Tobias Grosser,Michael F. P. O'Boyle
2023-10-06
Abstract:MLIR is an emerging compiler infrastructure for modern hardware, but existing programs cannot take advantage of MLIR's high-performance compilation if they are described in lower-level general purpose languages. Consequently, to avoid programs needing to be rewritten manually, this has led to efforts to automatically raise lower-level to higher-level dialects in MLIR. However, current methods rely on manually-defined raising rules, which limit their applicability and make them challenging to maintain as MLIR dialects evolve. We present mlirSynth -- a novel approach which translates programs from lower-level MLIR dialects to high-level ones without manually defined rules. Instead, it uses available dialect definitions to construct a program space and searches it effectively using type constraints and equivalences. We demonstrate its effectiveness \revi{by raising C programs} to two distinct high-level MLIR dialects, which enables us to use existing high-level dialect specific compilation flows. On Polybench, we show a greater coverage than previous approaches, resulting in geomean speedups of 2.5x (Intel) and 3.4x (AMD) over state-of-the-art compilation flows for the C programming language. mlirSynth also enables retargetability to domain-specific accelerators, resulting in a geomean speedup of 21.6x on a TPU.
Programming Languages,Computation and Language,Distributed, Parallel, and Cluster Computing,Performance
What problem does this paper attempt to address?
The paper is primarily dedicated to addressing the problem of automatically converting existing low-level programming language code to a high-level intermediate representation (IR) to leverage modern hardware acceleration. Specifically, the paper proposes a new method called mlirSynth, which can automatically elevate lower-level MLIR (Multi-Level IR) dialects to higher-level dialects without manually defining rules. The paper points out that with the diversification of modern hardware, such as Tensor Processing Units (TPUs) and accelerators designed specifically for artificial intelligence, these hardware provide efficient performance but also increase programming complexity. To solve this problem, researchers have developed domain-specific languages (DSLs) that can simplify problem descriptions and efficiently map to different hardware accelerators through domain-specific compilers. MLIR, as an emerging compiler infrastructure, supports high-level program representation, but programs written in existing low-level languages find it difficult to directly utilize MLIR's high-performance compilation capabilities. The key contributions of the paper include: 1. **mlirSynth framework**: A method for automatically elevating lower-level MLIR dialects to higher-level dialects without the need for hard-coded compiler transformations or elevation rules. 2. **Scalable code synthesis method**: Automatically generates the search space based on MLIR's TableGen definitions, thereby synthesizing code across multiple MLIR dialects. 3. **Fast bottom-up enumeration search synthesizer**: Quickly prunes the search space using type constraints and input-output behavior equivalence. 4. **Higher coverage, performance, and accuracy compared to existing elevation methods**. The paper demonstrates the method by elevating C language programs to two different high-level MLIR dialects (Linalg IR and HLO IR), enabling the use of existing optimization compilation flows for these high-level dialects. Experimental results show that mlirSynth can cover more programs and generate more efficient implementations than existing techniques. For example, in the Polybench benchmark, mlirSynth achieved an average speedup of 2.5 times (Intel platform) and 3.4 times (AMD platform) compared to the LLVM-O3 compiler. Additionally, since mlirSynth can elevate programs to the HLO dialect, it can also leverage the XLA compiler for Tensor Processing Unit (TPU) compilation, achieving an average speedup of 21.6 times.