Line-level Semantic Structure Learning for Code Vulnerability Detection

Ziliang Wang,Ge Li,Jia Li,Yihong Dong,Yingfei Xiong,Zhi Jin
2024-11-08
Abstract:Unlike the flow structure of natural languages, programming languages have an inherent rigidity in structure and <a class="link-external link-http" href="http://grammar.However" rel="external noopener nofollow">this http URL</a>, existing detection methods based on pre-trained models typically treat code as a natural language sequence, ignoring its unique structural information. This hinders the models from understanding the code's semantic and structual <a class="link-external link-http" href="http://information.To" rel="external noopener nofollow">this http URL</a> address this problem, we introduce the Code Structure-Aware Network through Line-level Semantic Learning (CSLS), which comprises four components: code preprocessing, global semantic awareness, line semantic awareness, and line semantic structure <a class="link-external link-http" href="http://awareness.The" rel="external noopener nofollow">this http URL</a> preprocessing step transforms the code into two types of text: global code text and line-level code <a class="link-external link-http" href="http://text.Unlike" rel="external noopener nofollow">this http URL</a> typical preprocessing methods, CSLS retains structural elements such as newlines and indent characters to enhance the model's perception of code lines during global semantic <a class="link-external link-http" href="http://awareness.For" rel="external noopener nofollow">this http URL</a> line semantics structure awareness, the CSLS network emphasizes capturing structural relationships between line <a class="link-external link-http" href="http://semantics.Different" rel="external noopener nofollow">this http URL</a> from the structural modeling methods based on code blocks (control flow graphs) or tokens, CSLS uses line semantics as the minimum structural unit to learn nonlinear structural relationships, thereby improving the accuracy of code vulnerability <a class="link-external link-http" href="http://detection.We" rel="external noopener nofollow">this http URL</a> conducted extensive experiments on vulnerability detection datasets from real projects. The CSLS model outperforms the state-of-the-art baselines in code vulnerability detection, achieving 70.57% accuracy on the Devign dataset and a 49.59% F1 score on the Reveal dataset.
Software Engineering
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem that code structure information is ignored in existing code vulnerability detection methods. Specifically, traditional vulnerability detection methods based on pre - trained models usually regard code as a natural - language sequence and ignore the inherent structure and syntax characteristics of programming languages. This processing method limits the model's ability to understand code semantic and structure information, thus affecting the accuracy of vulnerability detection. To solve this problem, the author proposes a new framework - **Code Structure - Aware Network through Line - level Semantic Learning (CSLS)**. This framework enhances code structure awareness through the following four main components: 1. **Code Preprocessing**: - Convert the code into two text forms: global code text and line - level code text. - Retain structural elements (such as line breaks and indentation characters) during the preprocessing process to enhance the model's understanding of code lines. 2. **Global Semantic Awareness**: - Use a pre - trained model to process the global code text and capture global semantic and structure information. 3. **Line Semantic Awareness**: - Use a pre - trained model to process the line - level code text and capture the semantics of each line of code. 4. **Line Semantic Structure Awareness**: - Use the Transformer module to model the line - level semantic structure in code fragments and learn non - linear structure relationships. Through these improvements, the CSLS model has achieved performance significantly better than existing baseline models on vulnerability detection datasets of multiple real - world projects. In particular, it has achieved an accuracy of 70.57% on the Devign dataset and an F1 - score of 49.59% on the Reveal dataset. Experimental results show that retaining and using code structure information is crucial for improving the performance of code vulnerability detection models. ### Summary The core problem of this paper is that existing vulnerability detection methods fail to fully utilize the structure information of code, resulting in insufficient understanding of code semantics and structure by the model. To this end, the author proposes the CSLS framework, which significantly improves the accuracy and reliability of code vulnerability detection through multi - level semantic and structure awareness.