Abstract:Context: Identifying potential vulnerable code is important to improve the security of our software systems. However, the manual detection of software vulnerabilities requires expert knowledge and is time-consuming, and must be supported by automated techniques. Objective: Such automated vulnerability detection techniques should achieve a high accuracy, point developers directly to the vulnerable code fragments, scale to real-world software, generalize across the boundaries of a specific software project, and require no or only moderate setup or configuration effort. Method: In this article, we present Vudenc (Vulnerability Detection with Deep Learning on a Natural Codebase), a deep learning-based vulnerability detection tool that automatically learns features of vulnerable code from a large and real-world Python codebase. Vudenc applies a word2vec model to identify semantically similar code tokens and to provide a vector representation. A network of long-short-term memory cells (LSTM) is then used to classify vulnerable code token sequences at a fine-grained level, highlight the specific areas in the source code that are likely to contain vulnerabilities, and provide confidence levels for its predictions. Results: To evaluate Vudenc, we used 1,009 vulnerability-fixing commits from different GitHub repositories that contain seven different types of vulnerabilities (SQL injection, XSS, Command injection, XSRF, Remote code execution, Path disclosure, Open redirect) for training. In the experimental evaluation, Vudenc achieves a recall of 78%–87%, a precision of 82%–96%, and an F1 score of 80%–90%. Vudenc’s code, the datasets for the vulnerabilities, and the Python corpus for the word2vec model are available for reproduction. Conclusions: Our experimental results suggest that Vudenc is capable of outperforming most of its competitors in terms of vulnerably detection capabilities on real-world software. Comparable accuracy was only achieved on synthetic benchmarks, within single projects, or on a much coarser level of granularity such as entire source code files.

Detecting code vulnerabilities by learning from large-scale open source repositories

Automated software vulnerability detection with machine learning

Survey of Source Code Vulnerability Analysis Based on Deep Learning

Transformer-based Vulnerability Detection in Code at EditTime: Zero-shot, Few-shot, or Fine-tuning?

DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

DeepVulSeeker: A novel vulnerability identification framework via code graph structure and pre-training mechanism

Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability Detection

Enhancing Code Vulnerability Detection via Vulnerability-Preserving Data Augmentation

Systematic Analysis of Deep Learning Model for Vulnerable Code Detection

Software Vulnerability Mining and Analysis Based on Deep Learning

Vulnerability Detection in C/C++ Code with Deep Learning

Automated Vulnerability Detection in Source Code Using Minimum Intermediate Representation Learning

Combining Graph-Based Learning With Automated Data Collection for Code Vulnerability Detection

Machine-learning supported vulnerability detection in source code

VUDENC: Vulnerability Detection with Deep Learning on a Natural Codebase for Python

SQVDT: A Scalable Quantitative Vulnerability Detection Technique for Source Code Security Assessment.

Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study

DCDetector: An IoT terminal vulnerability mining system based on distributed deep ensemble learning under source code representation

Causative Insights into Open Source Software Security using Large Language Code Embeddings and Semantic Vulnerability Graph