SoK: Prudent Evaluation Practices for Fuzzing

Moritz Schloegel,Nils Bars,Nico Schiller,Lukas Bernhard,Tobias Scharnowski,Addison Crump,Arash Ale Ebrahim,Nicolai Bissantz,Marius Muench,Thorsten Holz
DOI: https://doi.org/10.1109/SP54263.2024.00137
2024-05-17
Abstract:Fuzzing has proven to be a highly effective approach to uncover software bugs over the past decade. After AFL popularized the groundbreaking concept of lightweight coverage feedback, the field of fuzzing has seen a vast amount of scientific work proposing new techniques, improving methodological aspects of existing strategies, or porting existing methods to new domains. All such work must demonstrate its merit by showing its applicability to a problem, measuring its performance, and often showing its superiority over existing works in a thorough, empirical evaluation. Yet, fuzzing is highly sensitive to its target, environment, and circumstances, e.g., randomness in the testing process. After all, relying on randomness is one of the core principles of fuzzing, governing many aspects of a fuzzer's behavior. Combined with the often highly difficult to control environment, the reproducibility of experiments is a crucial concern and requires a prudent evaluation setup. To address these threats to validity, several works, most notably Evaluating Fuzz Testing by Klees et al., have outlined how a carefully designed evaluation setup should be implemented, but it remains unknown to what extent their recommendations have been adopted in practice. In this work, we systematically analyze the evaluation of 150 fuzzing papers published at the top venues between 2018 and 2023. We study how existing guidelines are implemented and observe potential shortcomings and pitfalls. We find a surprising disregard of the existing guidelines regarding statistical tests and systematic errors in fuzzing evaluations. For example, when investigating reported bugs, ...
Software Engineering,Cryptography and Security
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problems of reproducibility and effectiveness in fuzzing evaluation methods. Specifically, the authors systematically analyzed 150 fuzzing papers published in top - level conferences from 2018 to 2023, studied whether these papers followed the existing evaluation guidelines, and identified the deficiencies and potential problems in the actual evaluation process. #### Main problems include: 1. **Reproducibility of evaluation**: - The high randomness of fuzzing makes it difficult to reproduce experimental results. Many papers failed to fully consider this point, causing the reliability and reproducibility of evaluation results to be questioned. - The experimental settings, resource allocation used, and selection of initial seeds in the papers may all affect the reproducibility of experimental results. 2. **Adoption of existing guidelines**: - Although there are already literatures (such as Klees et al.'s paper in 2018) that have proposed detailed evaluation guidelines, these guidelines have not been widely followed in practical applications. For example, statistical testing and the handling of systematic errors are ignored in many papers. - Some papers used inappropriate benchmarks or target programs in the evaluation process, or did not perform a sufficient number of experimental repetitions to eliminate the influence of random factors. 3. **Selection of evaluation indicators**: - Many papers rely on proxy indicators (such as code coverage) while ignoring the core goal of fuzzing - finding vulnerabilities. This practice may lead to misjudgment of the performance of fuzzing tools. - Some papers used data sets with artificially injected vulnerabilities, which are no longer recommended because these vulnerabilities are usually too simple to reflect the complex situations in the real world. 4. **Defects in actual evaluation**: - The authors attempted to reproduce some of the evaluation results of eight fuzzing papers and found multiple problems, such as unreasonable experimental settings, unfair resource allocation, and improper selection of initial seeds. - These problems indicate that there are many challenges in the current fuzzing evaluation methods and need to be improved to ensure the scientific nature and reproducibility of evaluation results. ### Conclusions and Recommendations Based on the above analysis, the authors proposed a series of updated guidelines and best practices, aiming to help future research work evaluate fuzzing methods more scientifically and reliably. These recommendations include but are not limited to: - **Increase the number of experimental repetitions**: In order to reduce the impact of randomness on results, it is recommended to conduct multiple experiments. - **Use statistical methods**: Verify the effectiveness of new methods through appropriate statistical tests (such as Mann - Whitney U - test). - **Select appropriate evaluation indicators**: Give priority to using the ability to find vulnerabilities as the main evaluation criterion, supplemented by secondary indicators such as code coverage. - **Allocate resources fairly**: Ensure that all compared fuzzing tools run under the same resource conditions. - **Transparently disclose experimental details**: Record and publicly disclose experimental settings, initial seeds, resources used, etc. in detail so that other researchers can reproduce the experiments. Through these improvement measures, the authors hope to promote the evaluation methods in the fuzzing field to be more rigorous and scientific, thereby improving the quality and credibility of research results.