Enhancing Trust in LLM-Generated Code Summaries with Calibrated Confidence Scores

Yuvraj Virk,Premkumar Devanbu,Toufique Ahmed
2024-04-30
Abstract:A good summary can often be very useful during program comprehension. While a brief, fluent, and relevant summary can be helpful, it does require significant human effort to produce. Often, good summaries are unavailable in software projects, thus making maintenance more difficult. There has been a considerable body of research into automated AI-based methods, using Large Language models (LLMs), to generate summaries of code; there also has been quite a bit work on ways to measure the performance of such summarization methods, with special attention paid to how closely these AI-generated summaries resemble a summary a human might have produced. Measures such as BERTScore and BLEU have been suggested and evaluated with human-subject studies.
Software Engineering,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to provide a reliable confidence score for code summaries generated by large - language models (LLMs) to measure whether these summaries are close enough to human - written summaries. Specifically, the researchers are concerned with: 1. **Quality issues of generated code summaries**: Although LLMs perform well in generating code summaries, they sometimes generate content that is not similar to human - written summaries. Therefore, a method is needed to evaluate whether the summaries generated by LLMs are close enough to human - written summaries. 2. **Calibration issues of confidence scores**: Given a code summary generated by an LLM, can a confidence score be calculated such that this score can accurately reflect the similarity between this summary and a human - written summary? The researchers hope that through calibration techniques, LLMs can provide reliable confidence scores, thereby helping developers make better use of these automatically generated summaries. ### Research background and motivation - **Importance of code summaries**: Code summaries are important tools for understanding code and can help maintainers understand the functions of code more quickly. However, writing high - quality code summaries requires a great deal of manpower, so research on automated generation of code summaries has received extensive attention. - **Limitations of existing evaluation metrics**: Existing evaluation metrics (such as BLEU, ROUGE, etc.) are mainly used to measure the lexical similarity between machine - generated summaries and human - written summaries, but in the field of code summaries, these metrics do not always reflect human evaluations of summary quality well. Therefore, the researchers introduced embedding - based similarity measures (such as SentenceBERT), which have a higher correlation with human evaluations. - **Necessity of confidence scores**: Since the summaries generated by LLMs are not always accurate, providing a reliable confidence score can help developers determine when they can trust these summaries and when they need further inspection. ### Research methods - **Dataset**: The researchers used the Java and Python code summary datasets in the CodeXGLUE benchmark and randomly selected 5,000 samples from them as the test set. - **Models**: Three models, GPT - 3.5 - Turbo, Code - Llama - 70b, and DeepSeek - Coder - 33b Instruct, were used in the experiment. - **Prompt methods**: To improve the performance of the models, the researchers adopted two prompt methods: - **Retrieval - enhanced few - shot learning**: Relevant samples are retrieved from the dataset through the BM25 algorithm to enhance the effect of few - shot learning. - **Automated Semantic Augmentation Prompt (ASAP)**: Intermediate steps are extracted through a static analysis algorithm to further enhance the performance of the model. - **Evaluation methods**: The researchers used multiple similarity measures (such as SentenceBERT - CS, Infersent - CS, BERTScore - CS, etc.) to evaluate the similarity between the generated summaries and human - written summaries, and defined "correctness" by setting different thresholds. ### Research questions 1. **Can the confidence scores of LLMs effectively predict the quality of generated summaries**? 2. **Can rescaling techniques improve the calibration performance of LLMs**? 3. **Can the confidence scores of the reflexive method (i.e., model self - evaluation) effectively predict the correctness of generated summaries**? Through the study of these questions, the authors hope to provide developers with a reliable method to evaluate and utilize code summaries generated by LLMs.