Abstract:Evaluating the capabilities of Large Language Models (LLMs) to assist teachers and students in educational tasks is receiving increasing attention. In this paper, we assess ChatGPT's capacities to solve and grade real programming exams, from an accredited BSc degree in Computer Science, written in Spanish. Our findings suggest that this AI model is only effective for solving simple coding tasks. Its proficiency in tackling complex problems or evaluating solutions authored by others are far from effective. As part of this research, we also release a new corpus of programming tasks and the corresponding prompts for solving the problems or grading the solutions. This resource can be further exploited by other research teams.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate the capabilities of large - language models (LLMs) in educational tasks, especially ChatGPT's performance in solving and grading programming exams written in Spanish. Specifically, the researchers focused on the following aspects: 1. **Evaluating ChatGPT's ability to solve programming and algorithm problems**: The researchers used a set of real - programming - exam questions from a computer - science - bachelor - degree curriculum. These questions cover a wide range from basic coding exercises to complex reasoning tasks. They tested ChatGPT's performance under different prompt conditions. 2. **Evaluating the feasibility of ChatGPT as an automatic - grading tool**: Besides being a problem - solving assistant, the researchers also explored whether ChatGPT can effectively evaluate the quality of students' submitted answers. To this end, they selected some students' exam papers for testing and compared the grades given by ChatGPT with those given by teachers. 3. **Providing new data resources**: To promote future research, the authors released a new corpus containing programming tasks and their corresponding prompts. This will help other research teams further evaluate the LLMs' ability to solve programming problems. ### Main findings - **Problem - solving ability**: ChatGPT can reach the passing line in some simple programming tasks, but performs poorly when facing complex problems. In particular, when it comes to advanced concepts such as the formal specification of abstract data types (ADT) and computational - complexity analysis, ChatGPT's performance is especially weak. - **Grading ability**: There is a significant deviation in ChatGPT's quality assessment of human - written answers. It tends to over - estimate the quality of the solutions, even for low - scored test papers. Therefore, it is currently not suitable as a reliable grading tool. - **The influence of prompt strategies**: Complex prompts did not significantly improve ChatGPT's performance, but may have introduced confusion instead. This means that simple direct questions may be a more effective way of interaction. ### Conclusion Although ChatGPT has certain potential in handling basic programming tasks, it does not yet have the ability to handle complex programming problems or accurately evaluate students' work. In addition, due to its limited support for non - English languages, future research should continue to explore how to improve the performance of these models in a multilingual environment.

ChatGPT as a Solver and Grader of Programming Exams written in Spanish

The potential of large language models for improving probability learning: A study on ChatGPT3.5 and first-year computer engineering students

Can ChatGPT Play the Role of a Teaching Assistant in an Introductory Programming Course?

ChatGPT, Can You Generate Solutions for my Coding Exercises? An Evaluation on its Effectiveness in an undergraduate Java Programming Course

Kattis vs. ChatGPT: Assessment and Evaluation of Programming Tasks in the Age of Artificial Intelligence

Can ChatGPT Pass An Introductory Level Functional Language Programming Course?

Extending the Frontier of ChatGPT: Code Generation and Debugging

Unmasking the giant: A comprehensive evaluation of ChatGPT's proficiency in coding algorithms and data structures

Can Chat GPT solve a Linguistics Exam?

ChatGPT-4 as a Tool for Reviewing Academic Books in Spanish

Analyzing Chat Protocols of Novice Programmers Solving Introductory Programming Tasks with ChatGPT

Analysis of ChatGPT on Source Code

ChatGPT in Linear Algebra: Strides Forward, Steps to Go

ChatGPT in the classroom. Exploring its potential and limitations in a Functional Programming course

Is ChatGPT a General-Purpose Natural Language Processing Task Solver?

"It's not like Jarvis, but it's pretty close!" -- Examining ChatGPT's Usage among Undergraduate Students in Computer Science

ChatGPT as an AI L2 teaching support: A case study of an EFL teacher

LLM examiner: automating assessment in informal self-directed e-learning using ChatGPT

Evaluating ChatGPT-Generated Linear Algebra Formative Assessments

Students' Experiences of Using ChatGPT in an Undergraduate Programming Course

Can ChatGPT pass Glycobiology?