Vulnerability detection in Java source code using a quantum convolutional neural network with self-attentive pooling, deep sequence, and graph-based hybrid feature extraction

Shumaila Hussain,Muhammad Nadeem,Junaid Baber,Mohammed Hamdi,Adel Rajab,Mana Saleh Al Reshan,Asadullah Shaikh
DOI: https://doi.org/10.1038/s41598-024-56871-z
IF: 4.6
2024-03-30
Scientific Reports
Abstract:Software vulnerabilities pose a significant threat to system security, necessitating effective automatic detection methods. Current techniques face challenges such as dependency issues, language bias, and coarse detection granularity. This study presents a novel deep learning-based vulnerability detection system for Java code. Leveraging hybrid feature extraction through graph and sequence-based techniques enhances semantic and syntactic understanding. The system utilizes control flow graphs (CFG), abstract syntax trees (AST), program dependencies (PD), and greedy longest-match first vectorization for graph representation. A hybrid neural network (GCN-RFEMLP) and the pre-trained CodeBERT model extract features, feeding them into a quantum convolutional neural network with self-attentive pooling. The system addresses issues like long-term information dependency and coarse detection granularity, employing intermediate code representation and inter-procedural slice code. To mitigate language bias, a benchmark software assurance reference dataset is employed. Evaluations demonstrate the system's superiority, achieving 99.2% accuracy in detecting vulnerabilities, outperforming benchmark methods. The proposed approach comprehensively addresses vulnerabilities, including improper input validation, missing authorizations, buffer overflow, cross-site scripting, and SQL injection attacks listed by common weakness enumeration (CWE).
multidisciplinary sciences
What problem does this paper attempt to address?
The paper aims to address several key issues in software vulnerability detection, particularly for Java source code. Current vulnerability detection techniques face some challenges, including dependency issues, language bias, and coarse-grained detection. This paper proposes a novel vulnerability detection system based on deep learning to address these issues through the following methods: 1. **Hybrid Feature Extraction**: Enhancing semantic and syntactic understanding by combining graph and sequence techniques, utilizing Control Flow Graph (CFG), Abstract Syntax Tree (AST), Program Dependency Graph (PD), and Greedy Longest Match Vectorization for graph representation. 2. **Quantum Convolutional Neural Network and Self-Attention Pooling**: Introducing Quantum Convolutional Neural Network (QCNN) and self-attention pooling mechanisms to improve long-term information dependency and fine-grained detection capabilities. 3. **Pre-trained Model**: Using the pre-trained CodeBERT model for feature extraction to reduce semantic gaps and improve the accuracy of vulnerability detection. 4. **Dataset Balancing**: Using the Software Assurance Reference Dataset (SARD) for model training and testing, and preprocessing the dataset to optimize results. Through these methods, the system can effectively detect various types of vulnerabilities, including improper input validation, SQL injection attacks, missing authorization, cross-site scripting attacks, and buffer overflow attacks. Experimental results show that the detection accuracy of this system reaches 99.2%, significantly outperforming existing benchmark methods.