Abstract:Large language models (LLMs) present an exciting opportunity for generating synthetic classroom data. Such data could include code containing a typical distribution of errors, simulated student behaviour to address the cold start problem when developing education tools, and synthetic user data when access to authentic data is restricted due to privacy reasons. In this research paper, we conduct a comparative study examining the distribution of bugs generated by LLMs in contrast to those produced by computing students. Leveraging data from two previous large-scale analyses of student-generated bugs, we investigate whether LLMs can be coaxed to exhibit bug patterns that are similar to authentic student bugs when prompted to inject errors into code. The results suggest that unguided, LLMs do not generate plausible error distributions, and many of the generated errors are unlikely to be generated by real students. However, with guidance including descriptions of common errors and typical frequencies, LLMs can be shepherded to generate realistic distributions of errors in synthetic code.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to explore whether large language models (LLMs) can generate synthetic error - prone code with an error distribution similar to that in the code written by real students. Specifically, the main research questions include: 1. **RQ1: Can LLMs generate error - prone code upon request?** - Researchers hope to understand whether LLMs can generate code containing errors when requested. 2. **RQ2: How much does prompt engineering affect the error distribution generated by LLMs?** - Through different prompt strategies (such as providing classification information or frequency information of error types), researchers hope to evaluate how these prompts affect the distribution of errors generated by LLMs. 3. **RQ3: What is the correlation or deviation between the error distribution generated by LLMs and that of human students?** - Researchers hope to evaluate whether the errors generated by LLMs are realistic and representative by comparing the error distribution generated by LLMs with that in the code written by actual students. ### Research background and motivation With the development of large language models, they have shown great potential in generating synthetic data. Especially in the field of programming education, generating synthetic code containing typical student errors can be used for multiple purposes, such as: - Solving the problem of being unable to obtain real student data due to privacy issues. - Providing initial user data for developing educational tools (such as intelligent tutoring systems) to solve the "cold - start" problem. - Providing teaching resources for teachers to help students understand common errors. However, the error distribution generated by unguided LLMs may be unrealistic, and many of the generated error types are unlikely to be made by real - life students. Therefore, researchers hope to make LLMs generate a more realistic error distribution through appropriate prompts (such as providing common error types and their frequencies). ### Method overview To answer the above research questions, researchers conducted two comprehensive studies: 1. **Study 1: Replicating the work of Altadmri and Brown** - Use GPT - 4 to generate errors in Java programs and evaluate the generated error distribution through three different prompt strategies (open - ended prompt, classification prompt, frequency prompt). - Compare the generated error distribution with the error distribution in the original student data. 2. **Study 2: Replicating the work of Rigby et al. on off - by - one errors in C - language array iteration** - Generate off - by - one errors in C - language code through the same prompt strategies and compare the generated error distribution with the distribution in the original student data. ### Main findings - **Unguided LLMs**: The generated error distribution is quite different from the real student data, and many of the generated error types do not conform to the actual situation. - **Classification prompt**: After providing a list of common error types, the error distribution generated by LLMs is closer to the real student data. - **Frequency prompt**: After further providing the specific frequencies of common errors, the error distribution generated by LLMs is the closest to the real student data, with no statistically significant difference. ### Conclusion The research shows that through appropriate prompt strategies, LLMs can generate synthetic code with an error distribution similar to that in the code written by real students. This is of great significance for programming education research and practice. Especially in cases where real student data cannot be obtained, synthetic data can be an effective alternative. --- If you have more questions or need further explanation, please feel free to ask!

Synthetic Students: A Comparative Study of Bug Distribution Between Large Language Models and Computing Students

LLM-itation is the Sincerest Form of Data: Generating Synthetic Buggy Code Submissions for Computing Education

Bugs in Large Language Models Generated Code: An Empirical Study

Large Language Models and Simple, Stupid Bugs

Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Bug In the Code Stack: Can LLMs Find Bugs in Large Python Code Stacks

Large Language Models of Code Fail at Completing Code with Potential Bugs

Large Language Models for In-Context Student Modeling: Synthesizing Student's Behavior in Visual Programming

Automatically Generating CS Learning Materials with Large Language Models

An Exploratory Study on Using Large Language Models for Mutation Testing

BugSpotter: Automated Generation of Code Debugging Exercises

Student Data Paradox and Curious Case of Single Student-Tutor Model: Regressive Side Effects of Training LLMs for Personalized Learning

How to Teach Programming in the AI Era? Using LLMs as a Teachable Agent for Debugging

Insights from Social Shaping Theory: The Appropriation of Large Language Models in an Undergraduate Programming Course

Comparing Code Explanations Created by Students and Large Language Models

Are Large Language Models Memorizing Bug Benchmarks?

Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction

How Beginning Programmers and Code LLMs (Mis)read Each Other