Abstract:Apart from forming the backbone of compiler optimization, static dataflow analysis has been widely applied in a vast variety of applications, such as bug detection, privacy analysis, program comprehension, etc. Despite its importance, performing interprocedural dataflow analysis on large-scale programs is well known to be challenging. In this paper, we propose a novel distributed analysis framework supporting the general interprocedural dataflow analysis. Inspired by large-scale graph processing, we devise dedicated distributed worklist algorithms for both whole-program analysis and incremental analysis. We implement these algorithms and develop a distributed framework called BigDataflow running on a large-scale cluster. The experimental results validate the promising performance of BigDataflow -- BigDataflow can finish analyzing the program of millions lines of code in minutes. Compared with the state-of-the-art, BigDataflow achieves much more analysis efficiency.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: **the challenges of performing inter - procedural dataflow analysis on large - scale programs, especially how to utilize distributed computing resources in the cloud environment to accelerate and scale up this kind of analysis**. ### Specific problems include: 1. **High memory consumption**: - Dataflow analysis of modern large - scale programs (such as programs with a million lines of code) needs to maintain a large number of dataflow facts, which makes memory consumption a serious bottleneck. Even with sparse representation methods, some types of analysis will still occupy hundreds of gigabytes of memory. 2. **Computation - intensive**: - Dataflow analysis involves applying transfer functions to each program statement, and this process is highly computation - intensive. Especially for flow - sensitive analysis, each transfer function calculation can be very expensive. For example, in pointer alias analysis, the dataflow facts at each program point need to capture the alias relationships between all variables in the entire program, which will lead to high CPU cycle consumption. 3. **Limitations of existing methods**: - Although previous works have attempted to accelerate dataflow analysis through distributed or parallelized methods, most of these methods only support specific types of analysis or rely on shared - memory environments and cannot be directly applied to large - scale distributed clusters. In addition, existing parallel algorithms lack considerations for task allocation, fault tolerance, and efficient communication between nodes, and it is difficult to directly adapt to the distributed environment. ### Solutions proposed in the paper: - **BigDataflow framework**: The author proposes a new distributed framework, BigDataflow, which can utilize large - scale distributed resources in the cloud environment to accelerate and scale up general inter - procedural dataflow analysis. This framework is based on the vertex - centric graph processing model and redesigns the classic worklist algorithm to adapt to the distributed environment. - **Incremental analysis**: To further improve efficiency, the paper also proposes an incremental distributed algorithm, which can achieve incremental dataflow analysis, thereby reducing redundant calculations. - **Optimized distributed worklist algorithm**: By optimizing the data collection strategy, only the updated dataflow facts in the previous step are transferred and merged, which significantly reduces unnecessary data transfer and merge operations and improves scalability and performance. In summary, this paper aims to solve the memory and computation bottleneck problems encountered when performing inter - procedural dataflow analysis on large - scale programs by utilizing distributed computing resources in the cloud environment, and provides an efficient distributed framework, BigDataflow, to achieve this goal.

Scaling Inter-procedural Dataflow Analysis on the Cloud

BigDataflow: A Distributed Interprocedural Dataflow Analysis Framework

SCAN: A Smart Application Platform for Empowering Parallelizations of Big Genomic Data Analysis in Clouds

Towards Efficient Large-Scale Interprocedural Program Static Analysis on Distributed Data-Parallel Computation

BigSpa: An Efficient Interprocedural Static Analysis Engine in the Cloud

Bigflow: A General Optimization Layer for Distributed Computing Frameworks

Survey of Distributed Computing Frameworks for Supporting Big Data Analysis

Graph-Centric Performance Analysis for Large-Scale Parallel Applications

DStream: A Streaming-Based Highly Parallel IFDS Framework.

Progressive online aggregation in a distributed stream system

Optimal and Perfectly Parallel Algorithms for On-demand Data-flow Analysis

Evaluation and Analysis of Distributed Graph-Parallel Processing Frameworks

A Policy of Task Allocation Base on Distributed Cluster Computing Towards Cloud

Identifying Scalability Bottlenecks for Large-Scale Parallel Programs with Graph Analysis

A Cost-Efficient Auto-Scaling Algorithm for Large-Scale Graph Processing in Cloud Environments with Heterogeneous Resources

NO2: Speeding Up Parallel Processing of Massive Compute-Intensive Tasks

Systemizing Interprocedural Static Analysis of Large-scale Systems Code with Graspan

Parameterized Algorithms for Scalable Interprocedural Data-flow Analysis

A System for Exploratory Analysis in Cloud

PipeFlow Engine: Pipeline Scheduling with Distributed Workflow Made Simple

Parallelization of Spherical Discontinuous Deformation Analysis (SDDA) for Geotechnical Problems Based on Cloud Computing Environment