LMs: Understanding Code Syntax and Semantics for Code Analysis

Wei Ma,Shangqing Liu,Zhihao Lin,Wenhan Wang,Qiang Hu,Ye Liu,Cen Zhang,Liming Nie,Li Li,Yang Liu
2024-02-13
Abstract:Large language models~(LLMs) demonstrate significant potential to revolutionize software engineering (SE) by exhibiting outstanding performance in SE tasks such as code and document generation. However, the high reliability and risk control requirements in software engineering raise concerns about the lack of interpretability of LLMs. To address this concern, we conducted a study to evaluate the capabilities of LLMs and their limitations for code analysis in SE. We break down the abilities needed for artificial intelligence~(AI) models to address SE tasks related to code analysis into three categories: 1) syntax understanding, 2) static behavior understanding, and 3) dynamic behavior understanding. Our investigation focused on the ability of LLMs to comprehend code syntax and semantic structures, which include abstract syntax trees (AST), control flow graphs (CFG), and call graphs (CG). We employed four state-of-the-art foundational models, GPT4, GPT3.5, StarCoder and CodeLlama-13b-instruct. We assessed the performance of LLMs on cross-language tasks involving C, Java, Python, and Solidity. Our findings revealed that while LLMs have a talent for understanding code syntax, they struggle with comprehending code semantics, particularly dynamic semantics. We conclude that LLMs possess capabilities similar to an Abstract Syntax Tree (AST) parser, demonstrating initial competencies in static code analysis. Furthermore, our study highlights that LLMs are susceptible to hallucinations when interpreting code semantic structures and fabricating nonexistent facts. These results indicate the need to explore methods to verify the correctness of LLM output to ensure its dependability in SE. More importantly, our study provides an initial answer to why the codes generated by LLM are usually syntax-correct but vulnerable.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to explore the capabilities of large language models (LLMs) in understanding the syntax and semantics of code within the field of software engineering (SE). Despite the impressive performance of LLMs in tasks such as code generation and documentation, their lack of interpretability has raised concerns in the software engineering community. To address this, the authors conducted a study to evaluate the capabilities and limitations of LLMs in code analysis. Specifically, the authors categorized the capabilities required for code analysis into three types: 1. **Syntax Understanding**: Including the understanding of abstract syntax trees (AST). 2. **Static Behavior Understanding**: Including the understanding of control flow graphs (CFG) and call graphs (CG). 3. **Dynamic Behavior Understanding**: Including the understanding of dynamic behavior. Through this study, the authors aim to answer the following questions: - **RQ1**: Can LLMs understand code syntax well? - **RQ2**: Can LLMs understand the static behavior of code? - **RQ3**: Can LLMs understand the dynamic behavior of code? ### Research Methodology To evaluate the capabilities of LLMs, the authors selected four state-of-the-art large language models: GPT4, GPT3.5, StarCoder, and CodeLlama-13b-instruct. They designed a series of code-related tasks (a total of 9 tasks) and tested them on 2,560 code samples. These tasks ranged from simple syntax understanding to complex dynamic behavior understanding. ### Key Findings 1. **Syntax Understanding**: - LLMs, especially GPT4, performed excellently in understanding code syntax. They were able to comprehend the syntactic roles within the code and could act as AST parsers. 2. **Static Behavior Understanding**: - LLMs demonstrated a certain level of capability in analyzing the static behavior of code, making them suitable as beginner-level static analysis tools. However, their performance in tasks such as data dependency analysis, taint analysis, and pointer analysis still needs improvement. 3. **Dynamic Behavior Understanding**: - LLMs showed limitations in approximating the dynamic behavior of code, resulting in poor performance in tasks such as equivalent mutation detection and unstable test reasoning. The authors attribute this mainly to issues with the pre-training data. ### Conclusion Through this comprehensive study, the authors revealed the strengths and weaknesses of LLMs in code analysis. These findings can help software developers better utilize large language models in software development, particularly in code analysis tasks. Overall, LLMs excel in understanding code syntax and some aspects of static behavior but still require further improvement in understanding dynamic behavior.