Abstract:Large Language Models (LLMs) show promise in generating code comments for novice programmers, but their educational effectiveness remains under-evaluated. This study assesses the instructional quality of code comments produced by GPT-4, GPT-3.5-Turbo, and Llama2, compared to expert-developed comments, focusing on their suitability for novices. Analyzing a dataset of ``easy'' level Java solutions from LeetCode, we find that GPT-4 exhibits comparable quality to expert comments in aspects critical for beginners, such as clarity, beginner-friendliness, concept elucidation, and step-by-step guidance. GPT-4 outperforms Llama2 in discussing complexity (chi-square = 11.40, p = 0.001) and is perceived as significantly more supportive for beginners than GPT-3.5 and Llama2 with Mann-Whitney U-statistics = 300.5 and 322.5, p = 0.0017 and 0.0003). This study highlights the potential of LLMs for generating code comments tailored to novice programmers.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the teaching quality of code comments generated by large language models (LLMs) for beginner programmers. Specifically, the researchers focus on the following aspects: 1. **Teaching Effect Evaluation**: The research aims to compare the teaching quality differences between code comments generated by LLMs such as GPT - 4, GPT - 3.5 - Turbo and Llama2 and those written by experts. The focus is on whether these comments are suitable for beginners, including aspects such as the clarity of the comments, beginner - friendliness, concept explanation and step - by - step guidance. 2. **Educational Significance and Effectiveness**: Although previous research has demonstrated the feasibility of using LLMs to generate code comments, the educational significance and effectiveness of these comments for beginner programmers have not been fully evaluated. This study explores the educational applicability and advantages of LLMs - generated comments by analyzing a dataset of "easy" - level Java programming problem solutions on LeetCode. 3. **Matching the Needs of Beginners**: The research also evaluates whether code comments from different sources can better meet the teaching needs of beginner programmers. By establishing and applying multiple evaluation criteria, the researchers hope to reveal which types of comments are more suitable for beginners and provide a basis for improving LLM - generated comments. ### Research Methods To achieve the above goals, the research adopts the following methods: - **Dataset Selection and Prompt Design**: Thirty "easy" - level Java programming problem solutions on LeetCode were selected as the dataset. LLMs were required to generate comments based on the given code snippets, simulating the scenario where beginners seek explanations for correct solutions. - **Model Selection**: Three advanced LLMs, GPT - 4, Llama2 and GPT - 3.5 - Turbo, were selected for the experiment because these models perform well in similar tasks and have strong language understanding abilities. - **Evaluation Criteria Development**: A comprehensive evaluation manual was developed based on eight key criteria to systematically evaluate the teaching quality of LLMs - generated comments. These criteria include explaining the programming concepts used, helping to identify common coding errors among beginners, and providing sufficient details to promote understanding. - **Expert Evaluation**: Four experts with at least three years of programming and computer science education experience were invited to conduct two rounds of blind evaluations on the LLMs - generated comments to ensure the objectivity and reliability of the evaluation process. ### Main Findings Through statistical analysis (such as Kruskal - Wallis H - test and chi - square test), the research has drawn the following conclusions: - **GPT - 4 Performs the Best**: On most evaluation criteria, the performance of GPT - 4 is close to or even better than that of human experts, especially in explaining complex concepts and supporting beginners. For example, in terms of discussion complexity, GPT - 4 is significantly better than Llama2 (chi - square value = 11.40, p = 0.001). - **Llama2 Performs Weakly**: Llama2 performs poorly in many aspects, especially in providing detailed comments and using simple vocabulary, and is significantly behind other models and human experts. - **Beginner - Friendliness**: GPT - 4 is considered the most beginner - friendly model, and its comments are more clear and easy to understand, which can effectively support beginners' learning. ### Conclusion The research shows that advanced LLMs such as GPT - 4 have great potential in generating high - quality teaching content and can sometimes even surpass human experts. However, the performance differences between different LLMs also indicate the need for customized improvements for specific educational scenarios. Future research should further expand the dataset range and combine user feedback to verify these findings and optimize the application of LLMs.

Evaluating the Quality of Code Comments Generated by Large Language Models for Novice Programmers

Using Large Language Models to Document Code: A First Quantitative and Qualitative Assessment

Exploring the Responses of Large Language Models to Beginner Programmers' Help Requests

A Comparative Analysis of Large Language Models for Code Documentation Generation

Code Needs Comments: Enhancing Code LLMs with Comment Augmentation

Evaluating Language Models for Generating and Judging Programming Feedback

Prompting and Fine-tuning Large Language Models for Automated Code Review Comment Generation

An evaluation of LLM code generation capabilities through graded exercises

Evaluation of the Programming Skills of Large Language Models

Examination of Code generated by Large Language Models

Comparing Code Explanations Created by Students and Large Language Models

Open Source Language Models Can Provide Feedback: Evaluating LLMs' Ability to Help Students Using GPT-4-As-A-Judge

Comments as Natural Logic Pivots: Improve Code Generation via Comment Perspective

Not the Silver Bullet: LLM-enhanced Programming Error Messages are Ineffective in Practice

Evaluating large language models in analysing classroom dialogue

Source Code Summarization in the Era of Large Language Models

Using an LLM to Help With Code Understanding

LAMNER: Code Comment Generation Using Character Language Model and Named Entity Recognition

Identifying Inaccurate Descriptions in LLM-generated Code Comments via Test Execution

Speculative Analysis for Quality Assessment of Code Comments

DocChecker: Bootstrapping Code Large Language Model for Detecting and Resolving Code-Comment Inconsistencies