Evaluating the Quality of Code Comments Generated by Large Language Models for Novice Programmers

Aysa Xuemo Fan,Arun Balajiee Lekshmi Narayanan,Mohammad Hassany,Jiaze Ke
2024-09-22
Abstract:Large Language Models (LLMs) show promise in generating code comments for novice programmers, but their educational effectiveness remains under-evaluated. This study assesses the instructional quality of code comments produced by GPT-4, GPT-3.5-Turbo, and Llama2, compared to expert-developed comments, focusing on their suitability for novices. Analyzing a dataset of ``easy'' level Java solutions from LeetCode, we find that GPT-4 exhibits comparable quality to expert comments in aspects critical for beginners, such as clarity, beginner-friendliness, concept elucidation, and step-by-step guidance. GPT-4 outperforms Llama2 in discussing complexity (chi-square = 11.40, p = 0.001) and is perceived as significantly more supportive for beginners than GPT-3.5 and Llama2 with Mann-Whitney U-statistics = 300.5 and 322.5, p = 0.0017 and 0.0003). This study highlights the potential of LLMs for generating code comments tailored to novice programmers.
Software Engineering,Artificial Intelligence,Human-Computer Interaction
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the teaching quality of code comments generated by large language models (LLMs) for beginner programmers. Specifically, the researchers focus on the following aspects: 1. **Teaching Effect Evaluation**: The research aims to compare the teaching quality differences between code comments generated by LLMs such as GPT - 4, GPT - 3.5 - Turbo and Llama2 and those written by experts. The focus is on whether these comments are suitable for beginners, including aspects such as the clarity of the comments, beginner - friendliness, concept explanation and step - by - step guidance. 2. **Educational Significance and Effectiveness**: Although previous research has demonstrated the feasibility of using LLMs to generate code comments, the educational significance and effectiveness of these comments for beginner programmers have not been fully evaluated. This study explores the educational applicability and advantages of LLMs - generated comments by analyzing a dataset of "easy" - level Java programming problem solutions on LeetCode. 3. **Matching the Needs of Beginners**: The research also evaluates whether code comments from different sources can better meet the teaching needs of beginner programmers. By establishing and applying multiple evaluation criteria, the researchers hope to reveal which types of comments are more suitable for beginners and provide a basis for improving LLM - generated comments. ### Research Methods To achieve the above goals, the research adopts the following methods: - **Dataset Selection and Prompt Design**: Thirty "easy" - level Java programming problem solutions on LeetCode were selected as the dataset. LLMs were required to generate comments based on the given code snippets, simulating the scenario where beginners seek explanations for correct solutions. - **Model Selection**: Three advanced LLMs, GPT - 4, Llama2 and GPT - 3.5 - Turbo, were selected for the experiment because these models perform well in similar tasks and have strong language understanding abilities. - **Evaluation Criteria Development**: A comprehensive evaluation manual was developed based on eight key criteria to systematically evaluate the teaching quality of LLMs - generated comments. These criteria include explaining the programming concepts used, helping to identify common coding errors among beginners, and providing sufficient details to promote understanding. - **Expert Evaluation**: Four experts with at least three years of programming and computer science education experience were invited to conduct two rounds of blind evaluations on the LLMs - generated comments to ensure the objectivity and reliability of the evaluation process. ### Main Findings Through statistical analysis (such as Kruskal - Wallis H - test and chi - square test), the research has drawn the following conclusions: - **GPT - 4 Performs the Best**: On most evaluation criteria, the performance of GPT - 4 is close to or even better than that of human experts, especially in explaining complex concepts and supporting beginners. For example, in terms of discussion complexity, GPT - 4 is significantly better than Llama2 (chi - square value = 11.40, p = 0.001). - **Llama2 Performs Weakly**: Llama2 performs poorly in many aspects, especially in providing detailed comments and using simple vocabulary, and is significantly behind other models and human experts. - **Beginner - Friendliness**: GPT - 4 is considered the most beginner - friendly model, and its comments are more clear and easy to understand, which can effectively support beginners' learning. ### Conclusion The research shows that advanced LLMs such as GPT - 4 have great potential in generating high - quality teaching content and can sometimes even surpass human experts. However, the performance differences between different LLMs also indicate the need for customized improvements for specific educational scenarios. Future research should further expand the dataset range and combine user feedback to verify these findings and optimize the application of LLMs.