Abstract:There is a great need for data in computing education research. Data is needed to understand how students behave, to train models of student behavior to optimally support students, and to develop and validate new assessment tools and learning analytics techniques. However, relatively few computing education datasets are shared openly, often due to privacy regulations and issues in making sure the data is anonymous. Large language models (LLMs) offer a promising approach to create large-scale, privacy-preserving synthetic data, which can be used to explore various aspects of student learning, develop and test educational technologies, and support research in areas where collecting real student data may be challenging or impractical. This work explores generating synthetic buggy code submissions for introductory programming exercises using GPT-4o. We compare the distribution of test case failures between synthetic and real student data from two courses to analyze the accuracy of the synthetic data in mimicking real student data. Our findings suggest that LLMs can be used to generate synthetic incorrect submissions that are not significantly different from real student data with regard to test case failure distributions. Our research contributes to the development of reliable synthetic datasets for computing education research and teaching, potentially accelerating progress in the field while preserving student privacy.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of data scarcity in computing education research, especially generating synthetic (i.e., virtual) error - code submissions that can be used for research and teaching. Specifically: 1. **Data requirements and privacy issues**: - Computing education research requires a large amount of data to understand student behavior, train student behavior models for optimal support, and develop and validate new assessment tools and learning analysis techniques. - However, publicly shared computing education data sets are relatively scarce, mainly due to privacy regulations and the challenges of ensuring data anonymization. 2. **Generating synthetic data using large language models (LLMs)**: - Large language models (such as GPT - 4) offer a promising approach to creating large - scale, privacy - protected synthetic data for exploring various aspects of student learning, developing and testing educational technologies, and supporting research when collecting real student data is challenging or impractical. 3. **Generating synthetic error - code submissions**: - This study focuses on using GPT - 4 to generate synthetic error - code submissions for introductory programming exercises and comparing the test - case failure distributions between these synthetic data and real student data from two courses to analyze the accuracy of the synthetic data in simulating real student data. 4. **Research contributions**: - This study contributes to the development of reliable synthetic data sets in computing education research and teaching, which may accelerate progress in this field while protecting student privacy. ### Specific research questions: The main research question in this paper is: > To what extent can generative AI models be used to generate synthetic error - code submissions for introductory programming exercises? By answering this question, the research hopes to reveal the ability of large language models to generate error - code similar to that of real students and explore the effectiveness of different prompting strategies.

LLM-itation is the Sincerest Form of Data: Generating Synthetic Buggy Code Submissions for Computing Education

Synthetic Students: A Comparative Study of Bug Distribution Between Large Language Models and Computing Students

Detecting LLM-Generated Text in Computing Education: A Comparative Study for ChatGPT Cases

Computing Education in the Era of Generative AI

Under the Surface: Tracking the Artifactuality of LLM-Generated Data

Data Contamination Through the Lens of Time

Open Source Language Models Can Provide Feedback: Evaluating LLMs' Ability to Help Students Using GPT-4-As-A-Judge

Automatically Generating CS Learning Materials with Large Language Models

Evaluating Language Models for Generating and Judging Programming Feedback

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Navigating the Pitfalls: Analyzing the Behavior of LLMs as a Coding Assistant for Computer Science Students—A Systematic Review of the Literature

Student Data Paradox and Curious Case of Single Student-Tutor Model: Regressive Side Effects of Training LLMs for Personalized Learning

Impeding LLM-assisted Cheating in Introductory Programming Assignments via Adversarial Perturbation

The Robots are Here: Navigating the Generative AI Revolution in Computing Education

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study

Automating Autograding: Large Language Models as Test Suite Generators for Introductory Programming

How to Teach Programming in the AI Era? Using LLMs as a Teachable Agent for Debugging

Can We Trust Large Language Models Generated Code? A Framework for In-Context Learning, Security Patterns, and Code Evaluations Across Diverse LLMs

Exploring the Responses of Large Language Models to Beginner Programmers' Help Requests