Scalable Defect Detection via Traversal on Code Graph

Zhengyao Liu,Xitong Zhong,Xingjing Deng,Shuo Hong,Xiang Gao,Hailong Sun
2024-06-12
Abstract:Detecting defects and vulnerabilities in the early stage has long been a challenge in software engineering. Static analysis, a technique that inspects code without execution, has emerged as a key strategy to address this challenge. Among recent advancements, the use of graph-based representations, particularly Code Property Graph (CPG), has gained traction due to its comprehensive depiction of code structure and semantics. Despite the progress, existing graph-based analysis tools still face performance and scalability issues. The main bottleneck lies in the size and complexity of CPG, which makes analyzing large codebases inefficient and memory-consuming. Also, query rules used by the current tools can be over-specific. Hence, we introduce QVoG, a graph-based static analysis platform for detecting defects and vulnerabilities. It employs a compressed CPG representation to maintain a reasonable graph size, thereby enhancing the overall query efficiency. Based on the CPG, it also offers a declarative query language to simplify the queries. Furthermore, it takes a step forward to integrate machine learning to enhance the generality of vulnerability detection. For projects consisting of 1,000,000+ lines of code, QVoG can complete analysis in approximately 15 minutes, as opposed to 19 minutes with CodeQL.
Software Engineering
What problem does this paper attempt to address?
This paper aims to address the challenges of detecting defects and vulnerabilities in the early stages of software engineering. Specifically, existing graph - query - based analysis tools face performance and scalability issues when dealing with large - scale codebases. The main bottleneck lies in the size and complexity of the Code Property Graph (CPG), which makes the analysis process inefficient and memory - consuming. In addition, the query rules used by current tools may be too specific, resulting in insufficient generalization ability and thus false positives or false negatives. To solve these problems, the authors propose QVoG, a static analysis platform based on graph - query analysis for detecting defects and vulnerabilities. The main innovations of QVoG include: 1. **Compressed Code Property Graph**: Compress the structure of the CPG by retaining only the necessary information, reducing the number of nodes and edges, thereby improving query efficiency. 2. **Dedicated Domain - Specific Language**: Design a declarative DSL similar to SQL to simplify the writing of query rules. 3. **Language - independent Query Interface**: Provide a consistent query interface that supports multiple programming languages, reducing the cost of supporting new languages. 4. **Combination of Graph Query and Deep Learning**: Utilize machine learning to enhance the generalization ability of queries and improve detection accuracy. 5. **Open - source Tool**: QVoG will be fully open - source, unlike the partially closed - source components of CodeQL and Joern. Through these improvements, QVoG can exhibit higher efficiency and accuracy when handling large - scale projects. For example, for a project with 1,500,000 lines of code, QVoG can complete CPG extraction in approximately 15 minutes, while CodeQL requires 19 minutes and has much lower memory consumption than Joern. In terms of precision, QVoG has an average precision rate of 90% and a recall rate of 95% on the Juliet test suite, outperforming Joern and CodeQL.