Representation vs. Model: What Matters Most for Source Code Vulnerability Detection

Wei Zheng,Abubakar Omari Abdallah Semasaba,Xiaoxue Wu,Samuel Akwasi Agyemang,Tao Liu,Yuan Ge
DOI: https://doi.org/10.1109/saner50967.2021.00082
2021-03-01
Abstract:Vulnerabilities in the source code of software are critical issues in the realm of software engineering. Coping with vulnerabilities in software source code is becoming more challenging due to several aspects of complexity and volume. Deep learning has gained popularity throughout the years as a means of addressing such issues. In this paper, we propose an evaluation of vulnerability detection performance on source code representations and evaluate how Machine Learning (ML) strategies can improve them. The structure of our experiment consists of 3 Deep Neural Networks (DNNs) in conjunction with five different source code representations; Abstract Syntax Trees (ASTs), Code Gadgets (CGs), Semantics-based Vulnerability Candidates (SeVCs), Lexed Code Representations (LCRs), and Composite Code Representations (CCRs). Experimental results show that employing different ML strategies in conjunction with the base model structure influences the performance results to a varying degree. However, ML-based techniques suffer from poor performance on class imbalance handling when used in conjunction with source code representations for software vulnerability detection.
What problem does this paper attempt to address?