Scaling Inter-procedural Dataflow Analysis on the Cloud

Zewen Sun,Yujin Zhang,Duanchen Xu,Yiyu Zhang,Yun Qi,Yueyang Wang,Yi Li,Zhaokang Wang,Yue Li,Xuandong Li,Zhiqiang Zuo,Qingda Lu,Wenwen Peng,Shengjian Guo
2024-12-17
Abstract:Apart from forming the backbone of compiler optimization, static dataflow analysis has been widely applied in a vast variety of applications, such as bug detection, privacy analysis, program comprehension, etc. Despite its importance, performing interprocedural dataflow analysis on large-scale programs is well known to be challenging. In this paper, we propose a novel distributed analysis framework supporting the general interprocedural dataflow analysis. Inspired by large-scale graph processing, we devise dedicated distributed worklist algorithms for both whole-program analysis and incremental analysis. We implement these algorithms and develop a distributed framework called BigDataflow running on a large-scale cluster. The experimental results validate the promising performance of BigDataflow -- BigDataflow can finish analyzing the program of millions lines of code in minutes. Compared with the state-of-the-art, BigDataflow achieves much more analysis efficiency.
Programming Languages,Operating Systems,Software Engineering
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: **the challenges of performing inter - procedural dataflow analysis on large - scale programs, especially how to utilize distributed computing resources in the cloud environment to accelerate and scale up this kind of analysis**. ### Specific problems include: 1. **High memory consumption**: - Dataflow analysis of modern large - scale programs (such as programs with a million lines of code) needs to maintain a large number of dataflow facts, which makes memory consumption a serious bottleneck. Even with sparse representation methods, some types of analysis will still occupy hundreds of gigabytes of memory. 2. **Computation - intensive**: - Dataflow analysis involves applying transfer functions to each program statement, and this process is highly computation - intensive. Especially for flow - sensitive analysis, each transfer function calculation can be very expensive. For example, in pointer alias analysis, the dataflow facts at each program point need to capture the alias relationships between all variables in the entire program, which will lead to high CPU cycle consumption. 3. **Limitations of existing methods**: - Although previous works have attempted to accelerate dataflow analysis through distributed or parallelized methods, most of these methods only support specific types of analysis or rely on shared - memory environments and cannot be directly applied to large - scale distributed clusters. In addition, existing parallel algorithms lack considerations for task allocation, fault tolerance, and efficient communication between nodes, and it is difficult to directly adapt to the distributed environment. ### Solutions proposed in the paper: - **BigDataflow framework**: The author proposes a new distributed framework, BigDataflow, which can utilize large - scale distributed resources in the cloud environment to accelerate and scale up general inter - procedural dataflow analysis. This framework is based on the vertex - centric graph processing model and redesigns the classic worklist algorithm to adapt to the distributed environment. - **Incremental analysis**: To further improve efficiency, the paper also proposes an incremental distributed algorithm, which can achieve incremental dataflow analysis, thereby reducing redundant calculations. - **Optimized distributed worklist algorithm**: By optimizing the data collection strategy, only the updated dataflow facts in the previous step are transferred and merged, which significantly reduces unnecessary data transfer and merge operations and improves scalability and performance. In summary, this paper aims to solve the memory and computation bottleneck problems encountered when performing inter - procedural dataflow analysis on large - scale programs by utilizing distributed computing resources in the cloud environment, and provides an efficient distributed framework, BigDataflow, to achieve this goal.