Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?

Madeline Endres,Sarah Fakhoury,Saikat Chakraborty,Shuvendu K. Lahiri

2024-04-16

Abstract:Informal natural language that describes code functionality, such as code comments or function documentation, may contain substantial information about a programs intent. However, there is typically no guarantee that a programs implementation and natural language documentation are aligned. In the case of a conflict, leveraging information in code-adjacent natural language has the potential to enhance fault localization, debugging, and code trustworthiness. In practice, however, this information is often underutilized due to the inherent ambiguity of natural language which makes natural language intent challenging to check programmatically. The emergent abilities of Large Language Models (LLMs) have the potential to facilitate the translation of natural language intent to programmatically checkable assertions. However, it is unclear if LLMs can correctly translate informal natural language specifications into formal specifications that match programmer intent. Additionally, it is unclear if such translation could be useful in practice. In this paper, we describe nl2postcond, the problem of leveraging LLMs for transforming informal natural language to formal method postconditions, expressed as program assertions. We introduce and validate metrics to measure and compare different nl2postcond approaches, using the correctness and discriminative power of generated postconditions. We then use qualitative and quantitative methods to assess the quality of nl2postcond postconditions, finding that they are generally correct and able to discriminate incorrect code. Finally, we find that nl2postcond via LLMs has the potential to be helpful in practice; nl2postcond generated postconditions were able to catch 64 real-world historical bugs from Defects4J.

Software Engineering,Artificial Intelligence,Programming Languages

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to explore whether large language models (LLMs) can translate functional intents described in natural language into formal program postconditions (i.e., assertions). Specifically, the paper addresses the following questions: 1. **Can the postconditions generated by LLMs accurately formalize informal natural language intents?** - This question is addressed by evaluating the quality of the postconditions generated by LLMs, which should correctly reflect the programmer's intent. 2. **Can the postconditions generated by LLMs help detect real-world software defects?** - This is evaluated through empirical studies on multiple programming language benchmarks to assess the ability of these postconditions to discover real-world errors. The paper defines automated metrics to evaluate the correctness and completeness of the postconditions generated by LLMs and explores the performance of different LLMs as well as the effects of different prompt variants. Additionally, the paper uses LLMs to generate code mutants to assess the completeness of the specifications. Ultimately, the paper finds that, under appropriate natural language descriptions, LLMs can generate correct postconditions with high discriminative power, and these postconditions can capture historical defects in industrial-grade projects.

Can Large Language Models Transform Natural Language Intent into Formal Method Postconditions?

Large Language Models for Code Analysis: Do LLMs Really Do Their Job?

Learning from Failures: Translation of Natural Language Requirements into Linear Temporal Logic with Large Language Models

Towards Large Language Model Aided Program Refinement

Evaluating LLM-driven User-Intent Formalization for Verification-Aware Languages

nl2spec: Interactively Translating Unstructured Natural Language to Temporal Logics with Large Language Models

Beyond Code Generation: Assessing Code LLM Maturity with Postconditions

Exploring Automated Assertion Generation Via Large Language Models

Combining LLM Code Generation with Formal Specifications and Reactive Program Synthesis

Impact of Large Language Models on Generating Software Specifications

Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis

Automated Theorem Provers Help Improve Large Language Model Reasoning

LMs: Understanding Code Syntax and Semantics for Code Analysis

LLM4VV: Exploring LLM-as-a-Judge for Validation and Verification Testsuites

Are We Testing or Being Tested? Exploring the Practical Applications of Large Language Models in Software Testing

An Exploratory Study on Using Large Language Models for Mutation Testing

Logical Consistency of Large Language Models in Fact-checking

LLM2: Let Large Language Models Harness System 2 Reasoning

Large Language Models Should Ask Clarifying Questions to Increase Confidence in Generated Code

On the Effectiveness of LLMs for Manual Test Verifications

A Deep Dive into Large Language Model Code Generation Mistakes: What and Why?