A Case Study of LLM for Automated Vulnerability Repair: Assessing Impact of Reasoning and Patch Validation Feedback

Ummay Kulsum,Haotian Zhu,Bowen Xu,Marcelo d'Amorim
2024-05-25
Abstract:Recent work in automated program repair (APR) proposes the use of reasoning and patch validation feedback to reduce the semantic gap between the LLMs and the code under analysis. The idea has been shown to perform well for general APR, but its effectiveness in other particular contexts remains underexplored. In this work, we assess the impact of reasoning and patch validation feedback to LLMs in the context of vulnerability repair, an important and challenging task in security. To support the evaluation, we present VRpilot, an LLM-based vulnerability repair technique based on reasoning and patch validation feedback. VRpilot (1) uses a chain-of-thought prompt to reason about a vulnerability prior to generating patch candidates and (2) iteratively refines prompts according to the output of external tools (e.g., compiler, code sanitizers, test suite, etc.) on previously-generated patches. To evaluate performance, we compare VRpilot against the state-of-the-art vulnerability repair techniques for C and Java using public datasets from the literature. Our results show that VRpilot generates, on average, 14% and 7.6% more correct patches than the baseline techniques on C and Java, respectively. We show, through an ablation study, that reasoning and patch validation feedback are critical. We report several lessons from this study and potential directions for advancing LLM-empowered vulnerability repair
Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to use large - language models (LLMs) and methods based on reasoning and patch - verification feedback to improve the effectiveness of automated vulnerability repair, especially in applications in the security field**. ### Specific problem description 1. **Semantic gap problem**: Existing large - language models (LLMs) have deficiencies in understanding code semantics, especially when dealing with complex code - related tasks such as vulnerability repair. LLMs usually lack an understanding of specific code details, resulting in generated repair patches that may not meet expectations. 2. **Limitations of existing methods**: Although some research has proven the effectiveness of LLMs in general program repair, in specific security contexts (such as vulnerability repair), the effectiveness of these methods has not been fully verified. In particular, existing methods have limited ability to handle compilation errors, functional test failures, and security test failures. 3. **Improving repair effectiveness**: The paper proposes a new method - VRpilot, which aims to reduce the semantic gap between LLMs and the code to be analyzed by introducing reasoning and patch - verification feedback mechanisms, thereby improving the success rate and accuracy of vulnerability repair. ### Main contributions of the paper 1. **Proposing the VRpilot tool**: This tool is based on reasoning and feedback mechanisms and uses LLMs for vulnerability repair. It enhances the understanding ability of LLMs through chain - of - thought prompts and iteratively optimizes the generated patches through feedback from external tools (such as compilers, test suites, etc.). 2. **Evaluating performance**: The paper experimentally compares VRpilot with existing state - of - the - art vulnerability repair techniques (such as CodexVR) and demonstrates the superior performance of VRpilot on C and Java datasets. The results show that VRpilot can generate more correct patches on average, which are 14% (for the C language) and 7.6% (for the Java language) higher than the baseline techniques respectively. 3. **Ablation study**: Through ablation study, the paper verifies the importance of reasoning and feedback mechanisms to the performance of VRpilot. The research shows that these two components have a significant impact on the proportion of reasonable patches generated, and the lack of either one will lead to a significant decline in performance. ### Conclusion By introducing reasoning and patch - verification feedback mechanisms, the paper successfully improves the performance of large - language models in vulnerability repair tasks. This method not only helps to improve the success rate of repair but also can better deal with complex security problems. Future research can further explore how to incorporate more domain knowledge into LLMs to enhance their application effectiveness in automated vulnerability repair.