Abstract:Generative AI and large language models hold great promise in enhancing computing education by powering next-generation educational technologies. State-of-the-art models like OpenAI’s ChatGPT [8] and GPT-4 [9] could enhance programming education in various roles, e.g., by acting as a personalized digital tutor for a student, a digital assistant for an educator, and a digital peer for collaborative learning [1, 2, 7]. In our work, we seek to comprehensively evaluate and benchmark state-of-the-art large language models for various scenarios in programming education. Recent works have evaluated several large language models in the context of programming education [4, 6, 10, 11, 12]. However, these works are limited for several reasons: they have typically focused on evaluating a specific model for a specific education scenario (e.g., generating explanations), or have considered models that are already outdated (e.g., OpenAI’s Codex [3] is no longer publicly available since March 2023). Consequently, there is a lack of systematic study that benchmarks state-of-the-art models for a comprehensive set of programming education scenarios. In our work, we systematically evaluate two models, ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with human tutors for a variety of scenarios in programming education. These scenarios are designed to capture distinct roles these models could play, namely digital tutors, assistants, and peers, as discussed above. More concretely, we consider the following six scenarios: (1) program repair, i.e., fixing a student’s buggy program; (2) hint generation, i.e., providing a natural language hint to the student to help resolve current issues; (3) grading feedback, i.e., grading a student’s program w.r.t. a given rubric; (4) peer programming, i.e., completing a partially written program or generating a sketch for the solution program; (5) task creation, i.e., generating new tasks that exercise specific types of concepts or bugs; (6) contextualized explanation, i.e., explaining specific concepts or functions in the context of a given program. Our study uses a mix of quantitative and qualitative evaluation to compare the performance of these models with the performance of human tutors. We conduct our evaluation based on 5 introductory Python programming problems with a diverse set of input/output specifications. For each of these problems, we consider 5 buggy programs based on publicly accessible submissions from geeksforgeeks.org [5] (see Figure 1); these buggy programs are picked to capture different types of bugs for each problem. We will provide a detailed analysis of the data and results in a longer version of this poster. Our preliminary results show that GPT-4 drastically outperforms ChatGPT (based on GPT-3.5) and comes close to human tutors’ performance for several scenarios.

Kattis vs. ChatGPT: Assessment and Evaluation of Programming Tasks in the Age of Artificial Intelligence

How Novice Programmers Use and Experience ChatGPT when Solving Programming Exercises in an Introductory Course

ChatGPT, Can You Generate Solutions for my Coding Exercises? An Evaluation on its Effectiveness in an undergraduate Java Programming Course

Potentiality of generative AI tools in higher education: Evaluating ChatGPT's viability as a teaching assistant for introductory programming courses

Exploring the Use of ChatGPT as a Tool for Learning and Assessment in Undergraduate Computer Science Curriculum: Opportunities and Challenges

Assessing Learning of Computer Programming Skills in the Age of Generative Artificial Intelligence.

ChatGPT in the classroom. Exploring its potential and limitations in a Functional Programming course

Students' Experiences of Using ChatGPT in an Undergraduate Programming Course

The AI Companion in Education: Analyzing the Pedagogical Potential of ChatGPT in Computer Science and Engineering

Cheaters or AI-Enhanced Learners: Consequences of ChatGPT for Programming Education

Beyond the Hype: A Cautionary Tale of ChatGPT in the Programming Classroom

Can ChatGPT Play the Role of a Teaching Assistant in an Introductory Programming Course?

ChatGPT: Challenges and Benefits in Software Programming for Higher Education

"Will I be replaced?" Assessing ChatGPT's effect on software development and programmer perceptions of AI tools

Evaluating AI-generated code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust

Student-AI Interaction: A Case Study of CS1 students

Assessing the Promise and Pitfalls of ChatGPT for Automated Code Generation

Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors

What factors will affect the effectiveness of using ChatGPT to solve programming problems? A quasi-experimental study

ChatGPT versus engineering education assessment: a multidisciplinary and multi-institutional benchmarking and analysis of this generative artificial intelligence tool to investigate assessment integrity

Student Mastery or AI Deception? Analyzing ChatGPT's Assessment Proficiency and Evaluating Detection Strategies