Abstract:Large language models~(LLMs) demonstrate significant potential to revolutionize software engineering (SE) by exhibiting outstanding performance in SE tasks such as code and document generation. However, the high reliability and risk control requirements in software engineering raise concerns about the lack of interpretability of LLMs. To address this concern, we conducted a study to evaluate the capabilities of LLMs and their limitations for code analysis in SE. We break down the abilities needed for artificial intelligence~(AI) models to address SE tasks related to code analysis into three categories: 1) syntax understanding, 2) static behavior understanding, and 3) dynamic behavior understanding. Our investigation focused on the ability of LLMs to comprehend code syntax and semantic structures, which include abstract syntax trees (AST), control flow graphs (CFG), and call graphs (CG). We employed four state-of-the-art foundational models, GPT4, GPT3.5, StarCoder and CodeLlama-13b-instruct. We assessed the performance of LLMs on cross-language tasks involving C, Java, Python, and Solidity. Our findings revealed that while LLMs have a talent for understanding code syntax, they struggle with comprehending code semantics, particularly dynamic semantics. We conclude that LLMs possess capabilities similar to an Abstract Syntax Tree (AST) parser, demonstrating initial competencies in static code analysis. Furthermore, our study highlights that LLMs are susceptible to hallucinations when interpreting code semantic structures and fabricating nonexistent facts. These results indicate the need to explore methods to verify the correctness of LLM output to ensure its dependability in SE. More importantly, our study provides an initial answer to why the codes generated by LLM are usually syntax-correct but vulnerable.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to explore the capabilities of large language models (LLMs) in understanding the syntax and semantics of code within the field of software engineering (SE). Despite the impressive performance of LLMs in tasks such as code generation and documentation, their lack of interpretability has raised concerns in the software engineering community. To address this, the authors conducted a study to evaluate the capabilities and limitations of LLMs in code analysis. Specifically, the authors categorized the capabilities required for code analysis into three types: 1. **Syntax Understanding**: Including the understanding of abstract syntax trees (AST). 2. **Static Behavior Understanding**: Including the understanding of control flow graphs (CFG) and call graphs (CG). 3. **Dynamic Behavior Understanding**: Including the understanding of dynamic behavior. Through this study, the authors aim to answer the following questions: - **RQ1**: Can LLMs understand code syntax well? - **RQ2**: Can LLMs understand the static behavior of code? - **RQ3**: Can LLMs understand the dynamic behavior of code? ### Research Methodology To evaluate the capabilities of LLMs, the authors selected four state-of-the-art large language models: GPT4, GPT3.5, StarCoder, and CodeLlama-13b-instruct. They designed a series of code-related tasks (a total of 9 tasks) and tested them on 2,560 code samples. These tasks ranged from simple syntax understanding to complex dynamic behavior understanding. ### Key Findings 1. **Syntax Understanding**: - LLMs, especially GPT4, performed excellently in understanding code syntax. They were able to comprehend the syntactic roles within the code and could act as AST parsers. 2. **Static Behavior Understanding**: - LLMs demonstrated a certain level of capability in analyzing the static behavior of code, making them suitable as beginner-level static analysis tools. However, their performance in tasks such as data dependency analysis, taint analysis, and pointer analysis still needs improvement. 3. **Dynamic Behavior Understanding**: - LLMs showed limitations in approximating the dynamic behavior of code, resulting in poor performance in tasks such as equivalent mutation detection and unstable test reasoning. The authors attribute this mainly to issues with the pre-training data. ### Conclusion Through this comprehensive study, the authors revealed the strengths and weaknesses of LLMs in code analysis. These findings can help software developers better utilize large language models in software development, particularly in code analysis tasks. Overall, LLMs excel in understanding code syntax and some aspects of static behavior but still require further improvement in understanding dynamic behavior.

LMs: Understanding Code Syntax and Semantics for Code Analysis

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

An Empirical Study on Capability of Large Language Models in Understanding Code Semantics

If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents

An Insight into Security Code Review with LLMs: Capabilities, Obstacles and Influential Factors

S3LLM: Large-Scale Scientific Software Understanding with LLMs using Source, Metadata, and Document

The Emergence of Large Language Models in Static Analysis: A First Look through Micro-Benchmarks

Large Language Models as Code Executors: An Exploratory Study

A Survey on Large Language Models for Software Engineering

A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends

A Critical Study of What Code-LLMs (Do Not) Learn

Source Code Summarization in the Era of Large Language Models

Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey

SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning

How Far Have We Gone in Binary Code Understanding Using Large Language Models

Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation

Comparing Code Explanations Created by Students and Large Language Models

AI-powered Code Review with LLMs: Early Results

Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar

A Survey on Large Language Models for Code Generation