ACcoding: A graph-based dataset for online judge programming

Kairui Chen,Fuqun Huang,Zejing Liu,Haomiao Yu,Liuchang Meng,Shasha Mo,Li Zhang,You Song

DOI: https://doi.org/10.1038/s41597-024-03392-z

2024-05-30

Scientific Data

Abstract:A well-designed educational programming dataset is a valuable asset for students and educators. Such a dataset enables students to improve their programming performances continuously, provides researchers with significant data sources to identify students' learning behaviours and enhance the quality of programming education. Several existing datasets for programming education are either limited by a small number of participating students or a short span of learning records, bringing great challenges to investigate students' learning patterns in programming. We present a graph-based large-scale dataset specialized in programming learning on Online Judge (OJ) platform. The dataset, named ACcoding , was built by a university teaching group. As of the submission date of the initial manuscript of this paper (May 6, 2022), the dataset contains 4,046,652 task-solving records submitted by 27,444 students on 4,559 programming tasks over a span of 6 years. The large size of the dataset, combined with rich functional features, empowers educators to trace students' programming progress and choose appropriate programming tasks for specific training purposes. We also presents examples of applications used by the dataset.

multidisciplinary sciences

What problem does this paper attempt to address?

The paper attempts to address the issue of the lack of high-quality, large-scale datasets in programming education. Specifically: - **Limitations of existing datasets**: Existing programming education datasets often have a limited number of tasks or a short recording time span, making it difficult to comprehensively reveal students' programming learning patterns. - **Uniqueness of programming education**: Compared to subjects like mathematics and English, programming education has unique characteristics, such as answers being in the form of source code, a single problem potentially having multiple correct answers, online programming platforms providing various types of feedback (e.g., runtime status, memory consumption, etc.), and students being able to repeatedly submit answers. The paper proposes a large-scale programming learning dataset based on a graph structure called ACcoding, aiming to overcome the above limitations and support various tasks in Educational Data Mining (EDM), such as knowledge tracing, learning path recommendation, and error analysis. The ACcoding dataset includes 4,046,652 programming task submission records from 27,444 students, covering 4,559 programming tasks over a span of 6 years. Additionally, the dataset provides a dynamic programming knowledge graph to better understand students' programming learning behaviors.

ACcoding: A graph-based dataset for online judge programming

Dataset of Program Source Codes Solving Unique Programming Exercises Generated by Digital Teaching Assistant

Programming Online Judge System

VisOJ: real-time visual learning analytics dashboard for online programming judge

CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

Programming Knowledge Tracing: A Comprehensive Dataset and A New Model

How problem difficulty and order influence programming education outcomes in online judge systems

Programming grid: a computer-aided education system for programming courses based on online judge

TACO: Topics in Algorithmic COde Generation Dataset

Peking University Oneline Judge and Its Applications

Hybrid Estimation for Open-Ended Questions with Early-Age Students' Block-Based Programming Answers.

Estimating Difficulty Levels of Programming Problems with Pre-trained Model

Educational Programming Systems for Learning at Scale

How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data

ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search

PROGpedia: Collection of source-code submitted to introductory programming assignments

Automatically Learning Topics and Difficulty Levels of Problems in Online Judge Systems

InstructCoder: Instruction Tuning Large Language Models for Code Editing

Small Private Online Judge: A New Tool for Empirical Education Research

CodeQA: A Question Answering Dataset for Source Code Comprehension

QACP: An Annotated Question Answering Dataset for Assisting Chinese Python Programming Learners