ChatGPT as a Solver and Grader of Programming Exams written in Spanish

Pablo Fernández-Saborido,Marcos Fernández-Pichel,David E. Losada
2024-09-23
Abstract:Evaluating the capabilities of Large Language Models (LLMs) to assist teachers and students in educational tasks is receiving increasing attention. In this paper, we assess ChatGPT's capacities to solve and grade real programming exams, from an accredited BSc degree in Computer Science, written in Spanish. Our findings suggest that this AI model is only effective for solving simple coding tasks. Its proficiency in tackling complex problems or evaluating solutions authored by others are far from effective. As part of this research, we also release a new corpus of programming tasks and the corresponding prompts for solving the problems or grading the solutions. This resource can be further exploited by other research teams.
Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the capabilities of large - language models (LLMs) in educational tasks, especially ChatGPT's performance in solving and grading programming exams written in Spanish. Specifically, the researchers focused on the following aspects: 1. **Evaluating ChatGPT's ability to solve programming and algorithm problems**: The researchers used a set of real - programming - exam questions from a computer - science - bachelor - degree curriculum. These questions cover a wide range from basic coding exercises to complex reasoning tasks. They tested ChatGPT's performance under different prompt conditions. 2. **Evaluating the feasibility of ChatGPT as an automatic - grading tool**: Besides being a problem - solving assistant, the researchers also explored whether ChatGPT can effectively evaluate the quality of students' submitted answers. To this end, they selected some students' exam papers for testing and compared the grades given by ChatGPT with those given by teachers. 3. **Providing new data resources**: To promote future research, the authors released a new corpus containing programming tasks and their corresponding prompts. This will help other research teams further evaluate the LLMs' ability to solve programming problems. ### Main findings - **Problem - solving ability**: ChatGPT can reach the passing line in some simple programming tasks, but performs poorly when facing complex problems. In particular, when it comes to advanced concepts such as the formal specification of abstract data types (ADT) and computational - complexity analysis, ChatGPT's performance is especially weak. - **Grading ability**: There is a significant deviation in ChatGPT's quality assessment of human - written answers. It tends to over - estimate the quality of the solutions, even for low - scored test papers. Therefore, it is currently not suitable as a reliable grading tool. - **The influence of prompt strategies**: Complex prompts did not significantly improve ChatGPT's performance, but may have introduced confusion instead. This means that simple direct questions may be a more effective way of interaction. ### Conclusion Although ChatGPT has certain potential in handling basic programming tasks, it does not yet have the ability to handle complex programming problems or accurately evaluate students' work. In addition, due to its limited support for non - English languages, future research should continue to explore how to improve the performance of these models in a multilingual environment.