ISC4DGF: Enhancing Directed Grey-box Fuzzing with LLM-Driven Initial Seed Corpus Generation

Yijiang Xu,Hongrui Jia,Liguo Chen,Xin Wang,Zhengran Zeng,Yidong Wang,Qing Gao,Jindong Wang,Wei Ye,Shikun Zhang,Zhonghai Wu
2024-09-22
Abstract:Fuzz testing is crucial for identifying software vulnerabilities, with coverage-guided grey-box fuzzers like AFL and Angora excelling in broad detection. However, as the need for targeted detection grows, directed grey-box fuzzing (DGF) has become essential, focusing on specific vulnerabilities. The initial seed corpus, which consists of carefully selected input samples that the fuzzer uses as a starting point, is fundamental in determining the paths that the fuzzer explores. A well-designed seed corpus can guide the fuzzer more effectively towards critical areas of the code, improving the efficiency and success of the fuzzing process. Even with its importance, many works concentrate on refining guidance mechanisms while paying less attention to optimizing the initial seed corpus. In this paper, we introduce ISC4DGF, a novel approach to generating optimized initial seed corpus for DGF using Large Language Models (LLMs). By leveraging LLMs' deep software understanding and refined user inputs, ISC4DGF creates precise seed corpus that efficiently trigger specific vulnerabilities. Implemented on AFL and tested against state-of-the-art fuzzers like AFLGo, FairFuzz, and Entropic using the Magma benchmark, ISC4DGF achieved a 35.63x speedup and 616.10x fewer target reaches. Moreover, ISC4DGF focused on more effectively detecting target vulnerabilities, enhancing efficiency while operating with reduced code coverage.
Software Engineering
What problem does this paper attempt to address?
The problem this paper attempts to address is the inefficiency in Directed Grey-Box Fuzzing (DGF) due to the inadequate selection and optimization of the Initial Seed Corpus, making it difficult to effectively trigger specific vulnerabilities. Specifically, the paper points out that although existing fuzzing methods perform well in broadly detecting software vulnerabilities, they still face challenges in directed detection of specific vulnerabilities. In particular, the quality of the initial seed corpus is crucial to the success of fuzzing, but many studies have primarily focused on improving the guidance mechanism while neglecting the importance of optimizing the initial seed corpus. To solve this problem, the paper proposes a method for generating the initial seed corpus based on a Large Language Model (LLM) — ISC4DGF. By leveraging the deep program understanding capabilities of LLM and the critical information provided by users, ISC4DGF can generate precise initial seeds, thereby more effectively triggering specific vulnerabilities and improving the efficiency and success rate of fuzzing.