A Comparative Study of AI-Generated (GPT-4) and Human-crafted MCQs in Programming Education

Jacob Doughty,Zipiao Wan,Anishka Bompelli,Jubahed Qayum,Taozhi Wang,Juran Zhang,Yujia Zheng,Aidan Doyle,Pragnya Sridhar,Arav Agarwal,Christopher Bogart,Eric Keylor,Can Kultur,Jaromir Savelka,Majd Sakr
DOI: https://doi.org/10.1145/3636243.3636256
2023-12-06
Abstract:There is a constant need for educators to develop and maintain effective up-to-date assessments. While there is a growing body of research in computing education on utilizing large language models (LLMs) in generation and engagement with coding exercises, the use of LLMs for generating programming MCQs has not been extensively explored. We analyzed the capability of GPT-4 to produce multiple-choice questions (MCQs) aligned with specific learning objectives (LOs) from Python programming classes in higher education. Specifically, we developed an LLM-powered (GPT-4) system for generation of MCQs from high-level course context and module-level LOs. We evaluated 651 LLM-generated and 449 human-crafted MCQs aligned to 246 LOs from 6 Python courses. We found that GPT-4 was capable of producing MCQs with clear language, a single correct choice, and high-quality distractors. We also observed that the generated MCQs appeared to be well-aligned with the LOs. Our findings can be leveraged by educators wishing to take advantage of the state-of-the-art generative models to support MCQ authoring efforts.
Computers and Society,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the ability to automatically generate high - quality multiple - choice questions (MCQs) in programming education. Specifically, the researchers have developed a system based on large - language models (LLMs, especially GPT - 4) for generating multiple - choice questions aligned with specific learning objectives (LOs). These learning objectives are from university - level Python programming courses. The main purpose of the study is to evaluate the quality of the MCQs generated by GPT - 4 and their alignment with the learning objectives, in order to reduce the time and effort burden on educators in developing assessment tools. The paper explores this topic through the following research questions: 1. To what extent do the generated MCQs meet typical quality requirements? Specifically, do they: - (i) Provide sufficient information in clear language; - (ii) Have a correct answer; - (iii) Have high - quality distractors; - (iv) Contain grammatically and logically correct code? 2. How well are the generated MCQs aligned with the specified module - level learning objectives? By solving the above problems, the researchers hope to provide an effective method to support the creation of MCQs using advanced generation models, thereby improving the efficiency and quality of programming education.