Synthetic Students: A Comparative Study of Bug Distribution Between Large Language Models and Computing Students

Stephen MacNeil,Magdalena Rogalska,Juho Leinonen,Paul Denny,Arto Hellas,Xandria Crosland
DOI: https://doi.org/10.1145/3649165.3690100
2024-10-12
Abstract:Large language models (LLMs) present an exciting opportunity for generating synthetic classroom data. Such data could include code containing a typical distribution of errors, simulated student behaviour to address the cold start problem when developing education tools, and synthetic user data when access to authentic data is restricted due to privacy reasons. In this research paper, we conduct a comparative study examining the distribution of bugs generated by LLMs in contrast to those produced by computing students. Leveraging data from two previous large-scale analyses of student-generated bugs, we investigate whether LLMs can be coaxed to exhibit bug patterns that are similar to authentic student bugs when prompted to inject errors into code. The results suggest that unguided, LLMs do not generate plausible error distributions, and many of the generated errors are unlikely to be generated by real students. However, with guidance including descriptions of common errors and typical frequencies, LLMs can be shepherded to generate realistic distributions of errors in synthetic code.
Computers and Society,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to explore whether large language models (LLMs) can generate synthetic error - prone code with an error distribution similar to that in the code written by real students. Specifically, the main research questions include: 1. **RQ1: Can LLMs generate error - prone code upon request?** - Researchers hope to understand whether LLMs can generate code containing errors when requested. 2. **RQ2: How much does prompt engineering affect the error distribution generated by LLMs?** - Through different prompt strategies (such as providing classification information or frequency information of error types), researchers hope to evaluate how these prompts affect the distribution of errors generated by LLMs. 3. **RQ3: What is the correlation or deviation between the error distribution generated by LLMs and that of human students?** - Researchers hope to evaluate whether the errors generated by LLMs are realistic and representative by comparing the error distribution generated by LLMs with that in the code written by actual students. ### Research background and motivation With the development of large language models, they have shown great potential in generating synthetic data. Especially in the field of programming education, generating synthetic code containing typical student errors can be used for multiple purposes, such as: - Solving the problem of being unable to obtain real student data due to privacy issues. - Providing initial user data for developing educational tools (such as intelligent tutoring systems) to solve the "cold - start" problem. - Providing teaching resources for teachers to help students understand common errors. However, the error distribution generated by unguided LLMs may be unrealistic, and many of the generated error types are unlikely to be made by real - life students. Therefore, researchers hope to make LLMs generate a more realistic error distribution through appropriate prompts (such as providing common error types and their frequencies). ### Method overview To answer the above research questions, researchers conducted two comprehensive studies: 1. **Study 1: Replicating the work of Altadmri and Brown** - Use GPT - 4 to generate errors in Java programs and evaluate the generated error distribution through three different prompt strategies (open - ended prompt, classification prompt, frequency prompt). - Compare the generated error distribution with the error distribution in the original student data. 2. **Study 2: Replicating the work of Rigby et al. on off - by - one errors in C - language array iteration** - Generate off - by - one errors in C - language code through the same prompt strategies and compare the generated error distribution with the distribution in the original student data. ### Main findings - **Unguided LLMs**: The generated error distribution is quite different from the real student data, and many of the generated error types do not conform to the actual situation. - **Classification prompt**: After providing a list of common error types, the error distribution generated by LLMs is closer to the real student data. - **Frequency prompt**: After further providing the specific frequencies of common errors, the error distribution generated by LLMs is the closest to the real student data, with no statistically significant difference. ### Conclusion The research shows that through appropriate prompt strategies, LLMs can generate synthetic code with an error distribution similar to that in the code written by real students. This is of great significance for programming education research and practice. Especially in cases where real student data cannot be obtained, synthetic data can be an effective alternative. --- If you have more questions or need further explanation, please feel free to ask!