Abstract:Recent studies have adopted pre-trained language models, such as CodeT5 and CodeGPT, for automated program generation tasks like code generation, repair, and translation. Numerous language model-based approaches have been proposed and evaluated on various benchmark datasets, demonstrating promising performance. However, there is still uncertainty about the reliability of these models, particularly their realistic ability to consistently transform code sequences. This raises the question: are these techniques sufficiently trustworthy for automated program generation? Consequently, Further research is needed to understand model logic and assess reliability and explainability. To bridge these research gaps, we conduct a thorough empirical study of eight popular language models on five representative datasets to determine the capabilities and limitations of automated program generation approaches. We further employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code transformation. We discover that state-of-the-art approaches suffer from inappropriate performance evaluation stemming from severe data duplication, causing over-optimistic results. Our explainability analysis reveals that, in various experimental scenarios, language models can recognize code grammar and structural information, but they exhibit limited robustness to changes in input sequences. Overall, more rigorous evaluation approaches and benchmarks are critical to enhance the reliability and explainability of automated program generation moving forward. Our findings provide important guidelines for this goal.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to explore the reliability and interpretability of pre-trained language models in the task of automated program generation. Although existing research indicates that these models perform well in tasks such as code generation, repair, and translation, their reliability and interpretability in practical applications remain uncertain. Specifically, the paper attempts to answer the following key questions: 1. **How do language models perform in program generation tasks?** - Although many pre-trained language models have been proposed and have shown promising capabilities in certain tasks, there is a lack of systematic and extensive performance evaluation. The paper conducts a comprehensive evaluation by assessing the performance of the latest state-of-the-art models in tasks such as code repair, review, translation, and generation. 2. **Are automated program generation methods reliable?** - This research question critically analyzes the experimental methods used to evaluate automated program generation models to identify potential experimental biases or prejudices that may affect performance evaluation. Specifically, the study investigates the representativeness and diversity of the training and testing datasets. Performance is highly dependent on the quality of the datasets (i.e., "garbage in, garbage out"). Extensive data repetition or lack of diversity may distort the results. The paper aims to reveal potential limitations that may lead to overestimation or underestimation of actual capabilities. 3. **Can we explain why automated program generation methods can (or cannot) reliably generate code sequences?** - Merely analyzing the generated code sequences is still insufficient to determine why language models are effective or ineffective. The main reason is that the basis on which these pre-trained language models predict new code sequences is largely unknown. Therefore, the paper employs interpretable artificial intelligence methods to understand which tokens contribute to the generated code sequences. Researchers hope to provide practical insights for future research through exploratory experiments. ### Main Contributions - **First Comprehensive Benchmark Study**: The paper conducts the first comprehensive benchmark study on the reliability and interpretability of pre-trained language models in program generation tasks. - **Revealing Experimental Biases**: The analysis shows that there are significant biases in previous work's experiments, including dataset repetition and overlapping inputs, which exaggerate performance claims. Interpretive analysis indicates that models ignore critical tokens and lack robustness, highlighting key challenges for actual deployment. - **Guidance for Future Research**: The results provide insights to guide future research towards more rigorous and reliable neural program generation language models. Through these questions and contributions, the paper provides important guidance and foundation for further research and development in the field of automated program generation.

On the Reliability and Explainability of Language Models for Program Generation

Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach

Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code?

No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT

Imperfect Code Generation: Uncovering Weaknesses in Automatic Code Generation by Large Language Models

An Extensive Study on Pre-trained Models for Program Understanding and Generation

Can ChatGPT replace StackOverflow? A Study on Robustness and Reliability of Large Language Model Code Generation

Evaluating Large Language Models in Class-Level Code Generation

Where Do Large Language Models Fail When Generating Code?

Towards Uncovering How Large Language Model Works: An Explainability Perspective

Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey

Explainable AI for Pre-Trained Code Models: What Do They Learn? When They Do Not Work?

Explainability for Large Language Models: A Survey

How secure is AI-generated Code: A Large-Scale Comparison of Large Language Models

The "code'' of Ethics:A Holistic Audit of AI Code Generators

Challenges and Opportunities in Text Generation Explainability

Can LLM Replace Stack Overflow? A Study on Robustness and Reliability of Large Language Model Code Generation

A Looming Replication Crisis in Evaluating Behavior in Language Models? Evidence and Solutions

Uncovering Weaknesses in Neural Code Generation