A Combined Feature Embedding Tools for Multi-Class Software Defect and Identification

Md. Fahim Sultan,Tasmin Karim,Md. Shazzad Hossain Shaon,Mohammad Wardat,Mst. Shapna Akter
2024-11-27
Abstract:In software, a vulnerability is a defect in a program that attackers might utilize to acquire unauthorized access, alter system functions, and acquire information. These vulnerabilities arise from programming faults, design flaws, incorrect setups, and a lack of security protective measures. To mitigate these vulnerabilities, regular software upgrades, code reviews, safe development techniques, and the use of security tools to find and fix problems have been important. Several ways have been delivered in recent studies to address difficulties related to software vulnerabilities. However, previous approaches have significant limitations, notably in feature embedding and precisely recognizing specific vulnerabilities. To overcome these draw- backs, we present CodeGraphNet, an experimental method that combines GraphCodeBERT and Graph Convolutional Network (GCN) approaches, where, CodeGraphNet reveals data in a high- dimensional vector space, with comparable or related properties grouped closer together. This method captures intricate relation- ships between features, providing for more exact identification and separation of vulnerabilities. Using this feature embedding approach, we employed four machine learning models, applying both independent testing and 10-fold cross-validation. The Deep- Tree model, which is a hybrid of a Decision Tree and a Neural Network, outperforms state-of-the-art approaches. In additional validation, we evaluated our model using feature embeddings from LSA, GloVe, FastText, CodeBERT and GraphCodeBERT, and found that the CodeGraphNet method presented improved vulnerability identification with 98% of accuracy. Our model was tested on a real-time dataset to determine its capacity to handle real-world data and to focus on defect localization, which might influence future studies.
Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the multi - class defect identification in software vulnerability detection, especially for the vulnerabilities of specific CWE (Common Weakness Enumerations) categories in C/C++ source code. Specifically, existing methods have significant limitations in feature embedding and accurately identifying specific vulnerabilities. To overcome these shortcomings, the author proposes a new method named CodeGraphNet. ### Main Problems 1. **Limitations of Existing Methods**: - Existing methods mainly rely on static analysis of code features and are difficult to capture complex semantic and structural information. - Some methods use control - flow testing but are difficult to adapt to the rapidly changing threat environment and the complexity of modern software systems. - Although deep - learning methods perform well in some aspects, they perform poorly when dealing with deeply nested code patterns (such as deeply nested loops, recursive calls, etc.), resulting in a large number of incorrect results. 2. **Need for More Precise Vulnerability Detection**: - A method that can more accurately identify and locate vulnerabilities is needed, especially for serious defects such as buffer overflows (CWE - 119) and memory leaks (CWE - 476). - A method that can represent code features in a high - dimensional vector space is needed, so that features with similar or related properties can be grouped together, thereby better capturing the complex relationships between features. ### Solutions To solve the above problems, the author proposes the CodeGraphNet method, which combines GraphCodeBERT and Graph Convolutional Network (GCN) as follows: - **Feature Embedding**: Convert code fragments into high - dimensional vector representations through GraphCodeBERT and use GCN to capture the complex dependencies in the code. - **Model Architecture**: CodeGraphNet adopts a Transformer - based graph structure, with each code line as a node and edges representing the execution flow, generating an adjacency matrix to capture the relationships between code lines. - **Classification Model**: Use four machine - learning models (including independent testing and 10 - fold cross - validation), among which the DeepTree model (a hybrid of decision trees and neural networks) performs well, achieving an accuracy rate of 98%. - **Vulnerability Location**: Highlight the specific code lines that may contain vulnerabilities through the LIME method to help developers quickly locate and fix problems. ### Summary This research aims to bridge the gap between traditional vulnerability scanners and real - world software defects by introducing powerful feature extraction techniques, improving the accuracy of vulnerability detection, especially in multi - class defect identification and location. This not only helps to discover potential security problems but also provides guidance for the development of future security tools.