Abstract:The exponential increase of hardware-software complexity has made it impossible for compiler engineers to find the right optimization heuristics manually. Predictive models have been shown to find near optimal heuristics with little human effort but they are limited by a severe lack of diverse benchmarks to train on. Generative AI has been used by researchers to synthesize benchmarks into existing datasets. However, the synthetic programs are short, exceedingly simple and lacking diversity in their features. We develop BenchPress, the first ML compiler benchmark generator that can be directed within source code feature representations. BenchPress synthesizes executable functions by infilling code that conditions on the program's left and right context. BenchPress uses active learning to introduce new benchmarks with unseen features into the dataset of Grewe's et al. CPU vs GPU heuristic, improving its acquired performance by 50%. BenchPress targets features that has been impossible for other synthesizers to reach. In 3 feature spaces, we outperform human-written code from GitHub, CLgen, CLSmith and the SRCIROR mutator in targeting the features of Rodinia benchmarks. BenchPress steers generation with beam search over a feature-agnostic language model. We improve this with BenchDirect which utilizes a directed LM that infills programs by jointly observing source code context and the compiler features that are targeted. BenchDirect achieves up to 36% better accuracy in targeting the features of Rodinia benchmarks, it is 1.8x more likely to give an exact match and it speeds up execution time by up to 72% compared to BenchPress. Both our models produce code that is difficult to distinguish from human-written code. We conduct a Turing test which shows our models' synthetic benchmarks are labelled as 'human-written' as often as human-written code from GitHub.

What problem does this paper attempt to address?

The paper aims to address the challenge of manually finding suitable optimization strategies during the compiler optimization process, especially in the context of exponentially increasing hardware and software complexity. Specifically, the goal of the paper is to improve the performance of predictive models used to find near-optimal compiler optimization strategies by generating synthetic benchmarks with specific characteristics. To tackle the aforementioned problem, the authors developed a machine learning compiler benchmark generator named BenchPress, which can perform guided (i.e., goal-oriented) generation in the representation of source code features. BenchPress is capable of synthesizing executable functions and conditioning the generation process by filling in code within the context of a program. Additionally, the paper introduces BenchDirect, an extended version that leverages guided language models to jointly observe source code context and compiler features, thereby generating code that meets specific requirements more efficiently. The main contributions of the paper include: 1. Developing a feature-space-oriented, guided code generator capable of generating compiler benchmarks with characteristics required by users or downstream tasks. 2. Proposing a method to automatically rank feature spaces using active learning to identify important feature regions for downstream tasks. 3. Implementing bidirectional source code generation by inserting [HOLE] markers at arbitrary positions in the sequence. 4. Developing BenchDirect, the first bidirectional language model for code filling based on compiler features, which outperforms BenchPress in generating code for specific features. Through experimental validation, BenchPress and BenchDirect outperform existing methods in synthesizing OpenCL benchmarks with specific characteristics, particularly in achieving features present in the Rodinia benchmark suite. Moreover, the code generated by these models is difficult to distinguish from human-written code, indicating high quality in the generated code.

BenchDirect: A Directed Language Model for Compiler Benchmarks

BenchPress: A Deep Active Benchmark Generator

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Hints-In-Browser: Benchmarking Language Models for Programming Feedback Generation

Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

LEGOBench: Scientific Leaderboard Generation Benchmark

BENCHAGENTS: Automated Benchmark Creation with Agent Interaction

CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM

CODEBench: A Neural Architecture and Hardware Accelerator Co-Design Framework

Benchmarking Language Model Creativity: A Case Study on Code Generation

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

EffiBench: Benchmarking the Efficiency of Automatically Generated Code

EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations

CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios