BenchDirect: A Directed Language Model for Compiler Benchmarks

Foivos Tsimpourlas,Pavlos Petoumenos,Min Xu,Chris Cummins,Kim Hazelwood,Ajitha Rajan,Hugh Leather
2023-03-03
Abstract:The exponential increase of hardware-software complexity has made it impossible for compiler engineers to find the right optimization heuristics manually. Predictive models have been shown to find near optimal heuristics with little human effort but they are limited by a severe lack of diverse benchmarks to train on. Generative AI has been used by researchers to synthesize benchmarks into existing datasets. However, the synthetic programs are short, exceedingly simple and lacking diversity in their features. We develop BenchPress, the first ML compiler benchmark generator that can be directed within source code feature representations. BenchPress synthesizes executable functions by infilling code that conditions on the program's left and right context. BenchPress uses active learning to introduce new benchmarks with unseen features into the dataset of Grewe's et al. CPU vs GPU heuristic, improving its acquired performance by 50%. BenchPress targets features that has been impossible for other synthesizers to reach. In 3 feature spaces, we outperform human-written code from GitHub, CLgen, CLSmith and the SRCIROR mutator in targeting the features of Rodinia benchmarks. BenchPress steers generation with beam search over a feature-agnostic language model. We improve this with BenchDirect which utilizes a directed LM that infills programs by jointly observing source code context and the compiler features that are targeted. BenchDirect achieves up to 36% better accuracy in targeting the features of Rodinia benchmarks, it is 1.8x more likely to give an exact match and it speeds up execution time by up to 72% compared to BenchPress. Both our models produce code that is difficult to distinguish from human-written code. We conduct a Turing test which shows our models' synthetic benchmarks are labelled as 'human-written' as often as human-written code from GitHub.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the challenge of manually finding suitable optimization strategies during the compiler optimization process, especially in the context of exponentially increasing hardware and software complexity. Specifically, the goal of the paper is to improve the performance of predictive models used to find near-optimal compiler optimization strategies by generating synthetic benchmarks with specific characteristics. To tackle the aforementioned problem, the authors developed a machine learning compiler benchmark generator named BenchPress, which can perform guided (i.e., goal-oriented) generation in the representation of source code features. BenchPress is capable of synthesizing executable functions and conditioning the generation process by filling in code within the context of a program. Additionally, the paper introduces BenchDirect, an extended version that leverages guided language models to jointly observe source code context and compiler features, thereby generating code that meets specific requirements more efficiently. The main contributions of the paper include: 1. Developing a feature-space-oriented, guided code generator capable of generating compiler benchmarks with characteristics required by users or downstream tasks. 2. Proposing a method to automatically rank feature spaces using active learning to identify important feature regions for downstream tasks. 3. Implementing bidirectional source code generation by inserting [HOLE] markers at arbitrary positions in the sequence. 4. Developing BenchDirect, the first bidirectional language model for code filling based on compiler features, which outperforms BenchPress in generating code for specific features. Through experimental validation, BenchPress and BenchDirect outperform existing methods in synthesizing OpenCL benchmarks with specific characteristics, particularly in achieving features present in the Rodinia benchmark suite. Moreover, the code generated by these models is difficult to distinguish from human-written code, indicating high quality in the generated code.