LLM-itation is the Sincerest Form of Data: Generating Synthetic Buggy Code Submissions for Computing Education

Juho Leinonen,Paul Denny,Olli Kiljunen,Stephen MacNeil,Sami Sarsa,Arto Hellas
2024-11-01
Abstract:There is a great need for data in computing education research. Data is needed to understand how students behave, to train models of student behavior to optimally support students, and to develop and validate new assessment tools and learning analytics techniques. However, relatively few computing education datasets are shared openly, often due to privacy regulations and issues in making sure the data is anonymous. Large language models (LLMs) offer a promising approach to create large-scale, privacy-preserving synthetic data, which can be used to explore various aspects of student learning, develop and test educational technologies, and support research in areas where collecting real student data may be challenging or impractical. This work explores generating synthetic buggy code submissions for introductory programming exercises using GPT-4o. We compare the distribution of test case failures between synthetic and real student data from two courses to analyze the accuracy of the synthetic data in mimicking real student data. Our findings suggest that LLMs can be used to generate synthetic incorrect submissions that are not significantly different from real student data with regard to test case failure distributions. Our research contributes to the development of reliable synthetic datasets for computing education research and teaching, potentially accelerating progress in the field while preserving student privacy.
Computers and Society,Software Engineering
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of data scarcity in computing education research, especially generating synthetic (i.e., virtual) error - code submissions that can be used for research and teaching. Specifically: 1. **Data requirements and privacy issues**: - Computing education research requires a large amount of data to understand student behavior, train student behavior models for optimal support, and develop and validate new assessment tools and learning analysis techniques. - However, publicly shared computing education data sets are relatively scarce, mainly due to privacy regulations and the challenges of ensuring data anonymization. 2. **Generating synthetic data using large language models (LLMs)**: - Large language models (such as GPT - 4) offer a promising approach to creating large - scale, privacy - protected synthetic data for exploring various aspects of student learning, developing and testing educational technologies, and supporting research when collecting real student data is challenging or impractical. 3. **Generating synthetic error - code submissions**: - This study focuses on using GPT - 4 to generate synthetic error - code submissions for introductory programming exercises and comparing the test - case failure distributions between these synthetic data and real student data from two courses to analyze the accuracy of the synthetic data in simulating real student data. 4. **Research contributions**: - This study contributes to the development of reliable synthetic data sets in computing education research and teaching, which may accelerate progress in this field while protecting student privacy. ### Specific research questions: The main research question in this paper is: > To what extent can generative AI models be used to generate synthetic error - code submissions for introductory programming exercises? By answering this question, the research hopes to reveal the ability of large language models to generate error - code similar to that of real students and explore the effectiveness of different prompting strategies.