LLM-Based Code Generation Method for Golang Compiler Testing

Qiuhan Gu
DOI: https://doi.org/10.1145/3611643.3617850
2023-11-30
Abstract:Modern optimizing compilers are among the most complex software systems humans build. One way to identify subtle compiler bugs is fuzzing. Both the quantity and the quality of testcases are crucial to the performance of fuzzing. Traditional testcase-generation methods, such as Csmith and YARPGen, have been proven successful at discovering compiler bugs. However, such generated testcases have limited coverage and quantity. In this paper, we present a code generation method for compiler testing based on LLM to maximize the quality and quantity of the generated code. In particular, to avoid undefined behavior and syntax errors in generated testcases, we design a filter strategy to clean the source code, preparing a high-quality dataset for the model training. Besides, we present a seed schedule strategy to improve code generation. We apply the method to test the Golang compiler and the result shows that our pipeline outperforms previous methods both qualitatively and quantitatively. It produces testcases with an average coverage of 3.38%, in contrast to the testcases generated by GoFuzz, which have an average coverage of 0.44%. Moreover, among all the generated testcases, only 2.79% exhibited syntax errors, and none displayed undefined behavior.
Computer Science
What problem does this paper attempt to address?