Program Analysis via Multiple Context Free Language Reachability

Giovanna Kobus Conrado,Adam Husted Kjelstrøm,Andreas Pavlogianni,Jaco van de Pol
2024-11-10
Abstract:Context-free language (CFL) reachability is a standard approach in static analyses, where the analysis question is phrased as a language reachability problem on a graph $G$ wrt a CFL L. While CFLs lack the expressiveness needed for high precision, common formalisms for context-sensitive languages are such that the corresponding reachability problem is undecidable. Are there useful context-sensitive language-reachability models for static analysis? In this paper, we introduce Multiple Context-Free Language (MCFL) reachability as an expressive yet tractable model for static program analysis. MCFLs form an infinite hierarchy of mildly context sensitive languages parameterized by a dimension $d$ and a rank $r$. We show the utility of MCFL reachability by developing a family of MCFLs that approximate interleaved Dyck reachability, a common but undecidable static analysis problem. We show that MCFL reachability be computed in $O(n^{2d+1})$ time on a graph of $n$ nodes when $r=1$, and $O(n^{d(r+1)})$ time when $r>1$. Moreover, we show that when $r=1$, the membership problem has a lower bound of $n^{2d}$ based on the Strong Exponential Time Hypothesis, while reachability for $d=1$ has a lower bound of $n^{3}$ based on the combinatorial Boolean Matrix Multiplication Hypothesis. Thus, for $r=1$, our algorithm is optimal within a factor $n$ for all levels of the hierarchy based on $d$. We implement our MCFL reachability algorithm and evaluate it by underapproximating interleaved Dyck reachability for a standard taint analysis for Android. Used alongside existing overapproximate methods, MCFL reachability discovers all tainted information on 8 out of 11 benchmarks, and confirms $94.3\%$ of the reachable pairs reported by the overapproximation on the remaining 3. To our knowledge, this is the first report of high and provable coverage for this challenging benchmark set.
Programming Languages,Computational Complexity,Formal Languages and Automata Theory
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to improve the precision of context - sensitivity and field - sensitivity in static analysis while maintaining solvability?** Specifically, although the traditional reachability analysis method based on context - free languages (CFL) can improve the analysis precision, it lacks sufficient expressive power; and the common context - sensitive language formalisms are too complex, resulting in the undecidability of the corresponding reachability problems. Therefore, researchers face a challenge: Does there exist an effective context - sensitive language reachability model applicable to static analysis? To solve this problem, the author introduced **Multi - Context - Free Language (MCFL) reachability** as a static program analysis model that is both expressive and solvable. MCFL is parameterized by dimension \(d\) and rank \(r\), forming a mildly context - sensitive language with an infinite hierarchy. As \(d\) and \(r\) increase, the expressive power of MCFL gradually increases, thus providing controllable analysis precision. ### Specific Problem Description 1. **Limitations of Traditional Methods**: - **Context - Free Language (CFL)**: Although it can increase the analysis precision, it lacks sufficient expressive power. - **Context - Sensitive Language**: It is too complex, resulting in the undecidability of reachability problems. 2. **Research Objectives**: - Look for a natural, efficient (polynomial - time), and practically accurate context - sensitive approximation method to solve the interleaved Dyck reachability problem, which is a common problem in static analysis. 3. **Proposed Solutions**: - Introduce **Multi - Context - Free Language (MCFL)** as a new language model. - Design a family of MCFLs to approximate the interleaved Dyck reachability problem and show its high coverage in practical applications. - Develop a general MCFL reachability algorithm and prove its complexity lower bound. ### Key Contributions 1. **MCFL Reachability as a Program Model**: - Propose MCFL reachability as an expressive and solvable context - sensitive formalism. - By adjusting dimension \(d\) and rank \(r\), form an infinite - hierarchy model that can gradually improve the expressive power and analysis precision. 2. **MCFL Reachability Algorithm**: - Develop a general algorithm to solve the \(d\)-MCFL(\(r\)) reachability problem. - Prove the time complexity of the algorithm: - When \(r = 1\), the time is \(O(\text{poly}(|G|)\cdot\delta\cdot n^{2d})\), where \(\delta\) is the maximum degree of the graph. - When \(r>1\), the time is \(O(\text{poly}(|G|)\cdot n^{d(r + 1)})\). 3. **Complexity Lower Bound**: - Based on fine - grained complexity theory, prove the complexity lower bounds of MCFL reachability and membership problems. - For example, for the case of \(r = 1\), prove that the lower bound of \(n^{2d}\) is tight. 4. **Experimental Evaluation**: - Implement the MCFL reachability algorithm and evaluate it on a standard benchmark test set. - The results show that MCFL reachability matches the over - approximation results of existing methods in most benchmark tests and can confirm 94.3% of taint information in some cases. Through these contributions, the paper shows the potential of MCFL reachability in static analysis, especially providing a more precise solution in dealing with context and field - sensitivity.