Abstract:Generative AI and large language models hold great promise in enhancing computing education by powering next-generation educational technologies. State-of-the-art models like OpenAI’s ChatGPT [8] and GPT-4 [9] could enhance programming education in various roles, e.g., by acting as a personalized digital tutor for a student, a digital assistant for an educator, and a digital peer for collaborative learning [1, 2, 7]. In our work, we seek to comprehensively evaluate and benchmark state-of-the-art large language models for various scenarios in programming education. Recent works have evaluated several large language models in the context of programming education [4, 6, 10, 11, 12]. However, these works are limited for several reasons: they have typically focused on evaluating a specific model for a specific education scenario (e.g., generating explanations), or have considered models that are already outdated (e.g., OpenAI’s Codex [3] is no longer publicly available since March 2023). Consequently, there is a lack of systematic study that benchmarks state-of-the-art models for a comprehensive set of programming education scenarios. In our work, we systematically evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with human tutors for a variety of scenarios in programming education. These scenarios are designed to capture distinct roles these models could play, namely digital tutors, assistants, and peers, as discussed above. More concretely, we consider the following six scenarios: (1) program repair, i.e., fixing a student’s buggy program; (2) hint generation, i.e., providing a natural language hint to the student to help resolve current issues; (3) grading feedback, i.e., grading a student’s program w.r.t. a given rubric; (4) peer programming, i.e., completing a partially written program or generating a sketch for the solution program; (5) task creation, i.e., generating new tasks that exercise specific types of concepts or bugs; (6) contextualized explanation, i.e., explaining specific concepts or functions in the context of a given program. Our study uses a mix of quantitative and qualitative evaluation to compare the performance of these models with the performance of human tutors. We conduct our evaluation based on 5 introductory Python programming problems with a diverse set of input/output specifications. For each of these problems, we consider 5 buggy programs based on publicly accessible submissions from geeksforgeeks.org [5] (see Figure 1); these buggy programs are picked to capture different types of bugs for each problem. We will provide a detailed analysis of the data and results in a longer version of this poster. Our preliminary results show that GPT-4 drastically outperforms ChatGPT (based on GPT-3.5) and comes close to human tutors’ performance for several scenarios.

Examining the Influence of Varied Levels of Domain Knowledge Base Inclusion in GPT-based Intelligent Tutors

How Good is ChatGPT in Giving Adaptive Guidance Using Knowledge Graphs in E-Learning Environments?

From Answers to Insights: Unveiling the Strengths and Limitations of ChatGPT and Biomedical Knowledge Graphs

Using Large Language Models to Assess Tutors' Performance in Reacting to Students Making Math Errors

Exploring Knowledge Tracing in Tutor-Student Dialogues using LLMs

Investigating the Efficacy of ChatGPT-3.5 for Tutoring in Chinese Elementary Education Settings

Can ChatGPT Replace Traditional KBQA Models? An In-depth Analysis of the Question Answering Performance of the GPT LLM Family

Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors

Large language models in education: A focus on the complementary relationship between human teachers and ChatGPT

Leveraging GenAI for an Intelligent Tutoring System for R: A Quantitative Evaluation of Large Language Models

KnowledGPT: Enhancing Large Language Models with Retrieval and Storage Access on Knowledge Bases

Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12 Education

Large Language Model‐Based Chatbots in Higher Education

ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models

Assessing the Efficacy of ChatGPT in Solving Questions Based on the Core Concepts in Physiology

Experiences from Integrating Large Language Model Chatbots into the Classroom

From GPT-3 to GPT-4: On the Evolving Efficacy of LLMs to Answer Multiple-choice Questions for Programming Classes in Higher Education

The model student: GPT-4 performance on graduate biomedical science exams

Utilizing ChatGPT in a Data Structures and Algorithms Course: A Teaching Assistant's Perspective

Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors

The AI Companion in Education: Analyzing the Pedagogical Potential of ChatGPT in Computer Science and Engineering