The Emergence of Large Language Models in Static Analysis: A First Look through Micro-Benchmarks

Ashwin Prasad Shivarpatna Venkatesh,Samkutty Sabu,Amir M. Mir,Sofia Reis,Eric Bodden
2024-02-28
Abstract:The application of Large Language Models (LLMs) in software engineering, particularly in static analysis tasks, represents a paradigm shift in the field. In this paper, we investigate the role that current LLMs can play in improving callgraph analysis and type inference for Python programs. Using the PyCG, HeaderGen, and TypeEvalPy micro-benchmarks, we evaluate 26 LLMs, including OpenAI's GPT series and open-source models such as LLaMA. Our study reveals that LLMs show promising results in type inference, demonstrating higher accuracy than traditional methods, yet they exhibit limitations in callgraph analysis. This contrast emphasizes the need for specialized fine-tuning of LLMs to better suit specific static analysis tasks. Our findings provide a foundation for further research towards integrating LLMs for static analysis tasks.
Software Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how effective large language models (LLMs) are in static analysis tasks in software engineering. Specifically, the paper focuses on two main aspects: 1. **Callgraph Analysis**: Callgraph analysis is used to understand the relationships and interactions between different components in a program. This research evaluated the performance of 26 different LLMs, including OpenAI's GPT series and open - source models such as LLaMA, in Python program callgraph analysis through micro - benchmarks. 2. **Type Inference**: Type inference helps to identify potential type errors and improve the reliability of code. The research also used micro - benchmarks to evaluate the accuracy of these LLMs in Python program type inference. Through these evaluations, the paper aims to explore the performance of current LLMs in static analysis tasks and point out their strengths and limitations. The research results show that LLMs perform well in type inference tasks, but still have deficiencies in callgraph analysis tasks, which emphasizes the need for fine - tuning and optimization of LLMs for specific tasks.