Software vulnerable functions discovery based on code composite feature
Xue Yuan,Guanjun Lin,Huan Mei,Yonghang Tai,Jun Zhang
DOI: https://doi.org/10.1016/j.jisa.2024.103718
IF: 4.96
2024-02-15
Journal of Information Security and Applications
Abstract:Vulnerability identification is crucial to protecting software systems from attacks. Although numerous learning-based solutions have been suggested to assist in vulnerability identification, these approaches often face challenges due to the scarcity of real-world vulnerability data. To extract as much vulnerability information as possible from limited data, we consider obtaining the characteristics of vulnerabilities from different forms of code by leveraging two distinct deep neural models. First, source code functions are considered to be textual sequences, and Gated Recurrent Unit (GRU) is applied to extract serialized features. Then, Syntax Trees (ASTs) of these functions, which reflects the code structure, are fed to a Gated Graph Recurrent Network (GGRN) to obtain structural features indicative of software vulnerability. To better handle data imbalance issues in real-world scenarios, we employ Random Forest (RF) to construct a predictive model to learn the concatenation of serialized and structural features extracted by GRU and GGRN. To evaluate the proposed approach, we collected 12 open-source projects containing function-level samples and compared the proposed method with a series of baselines, including popular learning-based methods and static analysis tools. The empirical results demonstrate that our proposed approach outperforms the baselines and can identify more vulnerabilities.
computer science, information systems