ACcoding: A graph-based dataset for online judge programming

Kairui Chen,Fuqun Huang,Zejing Liu,Haomiao Yu,Liuchang Meng,Shasha Mo,Li Zhang,You Song
DOI: https://doi.org/10.1038/s41597-024-03392-z
2024-05-30
Scientific Data
Abstract:A well-designed educational programming dataset is a valuable asset for students and educators. Such a dataset enables students to improve their programming performances continuously, provides researchers with significant data sources to identify students' learning behaviours and enhance the quality of programming education. Several existing datasets for programming education are either limited by a small number of participating students or a short span of learning records, bringing great challenges to investigate students' learning patterns in programming. We present a graph-based large-scale dataset specialized in programming learning on Online Judge (OJ) platform. The dataset, named ACcoding , was built by a university teaching group. As of the submission date of the initial manuscript of this paper (May 6, 2022), the dataset contains 4,046,652 task-solving records submitted by 27,444 students on 4,559 programming tasks over a span of 6 years. The large size of the dataset, combined with rich functional features, empowers educators to trace students' programming progress and choose appropriate programming tasks for specific training purposes. We also presents examples of applications used by the dataset.
multidisciplinary sciences
What problem does this paper attempt to address?
The paper attempts to address the issue of the lack of high-quality, large-scale datasets in programming education. Specifically: - **Limitations of existing datasets**: Existing programming education datasets often have a limited number of tasks or a short recording time span, making it difficult to comprehensively reveal students' programming learning patterns. - **Uniqueness of programming education**: Compared to subjects like mathematics and English, programming education has unique characteristics, such as answers being in the form of source code, a single problem potentially having multiple correct answers, online programming platforms providing various types of feedback (e.g., runtime status, memory consumption, etc.), and students being able to repeatedly submit answers. The paper proposes a large-scale programming learning dataset based on a graph structure called ACcoding, aiming to overcome the above limitations and support various tasks in Educational Data Mining (EDM), such as knowledge tracing, learning path recommendation, and error analysis. The ACcoding dataset includes 4,046,652 programming task submission records from 27,444 students, covering 4,559 programming tasks over a span of 6 years. Additionally, the dataset provides a dynamic programming knowledge graph to better understand students' programming learning behaviors.