Abstract:Automatically graded programming assignments provide instant feedback to students and significantly reduce manual grading time for instructors. However, creating comprehensive suites of test cases for programming problems within automatic graders can be time-consuming and complex. The effort needed to define test suites may deter some instructors from creating additional problems or lead to inadequate test coverage, potentially resulting in misleading feedback on student solutions. Such limitations may reduce student access to the well-documented benefits of timely feedback when learning programming. In this work, we evaluate the effectiveness of using Large Language Models (LLMs), as part of a larger workflow, to automatically generate test suites for CS1-level programming problems. Each problem's statement and reference solution are provided to GPT-4 to produce a test suite that can be used by an autograder. We evaluate our proposed approach using a sample of 26 problems, and more than 25,000 attempted solutions to those problems, submitted by students in an introductory programming course. We compare the performance of the LLM-generated test suites against the instructor-created test suites for each problem. Our findings reveal that LLM-generated test suites can correctly identify most valid solutions, and for most problems are at least as comprehensive as the instructor test suites. Additionally, the LLM-generated test suites exposed ambiguities in some problem statements, underscoring their potential to improve both autograding and instructional design.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem that test case generation in automatic grading systems is time - consuming and complex. Specifically, the paper focuses on how to use large - language models (LLMs) to automatically generate test cases for programming assignments, so as to reduce the workload of teachers in preparing these test cases and improve the efficiency and accuracy of automatic grading systems. #### Background and Problem Description Automatic grading systems provide students with immediate feedback and significantly reduce the time for teachers to grade manually. However, creating a comprehensive set of test cases is a time - consuming and complex task. If the test cases are not sufficient, it may lead to misleading feedback and affect students' learning outcomes. Therefore, many teachers may be reluctant to create additional questions or test cases, thus limiting students' opportunities to obtain timely feedback. #### Research Objectives The objective of this research is to evaluate the effectiveness of using large - language models (LLMs) to automatically generate test cases. Specifically, the author hopes to solve the problem in the following ways: 1. **Reduce teachers' workload**: By using LLMs to automatically generate test cases, reduce the time and energy that teachers spend on preparing test cases. 2. **Improve the quality of test cases**: Ensure that the generated test cases can correctly identify most valid student solutions and are at least as comprehensive as the test cases written manually by teachers. 3. **Discover potential problems**: Reveal the ambiguities in certain problem statements through the generated test cases, so as to improve automatic grading and instructional design. #### Method Overview The researchers selected 26 programming problems and more than 25,000 student - submitted solutions as samples. They used GPT - 4 to generate test cases according to the description and reference solutions of each problem, and compared these automatically generated test cases with those written manually by teachers to evaluate their correctness and comprehensiveness. #### Main Research Questions 1. **RQ1**: To what extent can the test cases generated by LLM correctly identify valid solutions at the CS1 level? 2. **RQ2**: How comprehensive are the test cases generated by LLM compared with those generated by teachers? 3. **RQ3**: What types of ambiguities in problem statements can the test cases generated by LLM reveal? By answering these questions, the researchers hope to demonstrate the potential of LLMs in automatically generating test cases, thereby improving the overall quality of automatic grading systems and programming education. ### Summary The main problem solved by this paper is that test case generation in automatic grading systems in programming education is time - consuming and complex. By introducing large - language models (LLMs), the researchers hope to simplify this process, improve the quality of test cases, and bring more possibilities to programming education.

Automating Autograding: Large Language Models as Test Suite Generators for Introductory Programming

Grading-Based Test Suite Augmentation.

Large Language Models As MOOCs Graders

Grade Like a Human: Rethinking Automated Assessment with Large Language Models

Using AI Large Language Models for Grading in Education: A Hands-On Test for Physics

Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course

An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation

Using Large Language Models for Student-Code Guided Test Case Generation in Computer Science Education

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation

Towards LLM-based Autograding for Short Textual Answers

Grading Massive Open Online Courses Using Large Language Models

A System for Automated Unit Test Generation Using Large Language Models and Assessment of Generated Test Suites

Towards Scalable Automated Grading: Leveraging Large Language Models for Conceptual Question Evaluation in Engineering

Automated Generation of Computer Graded Unit Testing-Based Programming Assessments for Education

Evaluating Language Models for Generating and Judging Programming Feedback

AI-assisted Automated Short Answer Grading of Handwritten University Level Mathematics Exams

Harnessing the Power of LLMs: Automating Unit Test Generation for High-Performance Computing

LLM4VV: Developing LLM-driven testsuite for compiler validation

Examination of Code generated by Large Language Models

Exploring the Responses of Large Language Models to Beginner Programmers' Help Requests

"Which LLM should I use?": Evaluating LLMs for tasks performed by Undergraduate Computer Science Students