Can GPT-4 do L2 analytic assessment?

Stefano Bannò,Hari Krishna Vydana,Kate M. Knill,Mark J. F. Gales
2024-04-29
Abstract:Automated essay scoring (AES) to evaluate second language (L2) proficiency has been a firmly established technology used in educational contexts for decades. Although holistic scoring has seen advancements in AES that match or even exceed human performance, analytic scoring still encounters issues as it inherits flaws and shortcomings from the human scoring process. The recent introduction of large language models presents new opportunities for automating the evaluation of specific aspects of L2 writing proficiency. In this paper, we perform a series of experiments using GPT-4 in a zero-shot fashion on a publicly available dataset annotated with holistic scores based on the Common European Framework of Reference and aim to extract detailed information about their underlying analytic components. We observe significant correlations between the automatically predicted analytic scores and multiple features associated with the individual proficiency components.
Computation and Language
What problem does this paper attempt to address?
The problem this paper attempts to address is whether GPT-4 can extract information about analytical aspects from the compositions of second language (L2) learners and their assigned holistic scores. Specifically, the authors explore through a series of experiments whether GPT-4, in a zero-shot setting, can predict analytical scores related to specific language proficiency components based on holistic scores from public datasets, and observe the significant correlations between these automatically predicted analytical scores and various features related to individual proficiency components. The main research background of the paper is that although Automated Essay Scoring (AES) technology has achieved performance comparable to or even surpassing human scoring in terms of holistic scores, there are still challenges in analytical scoring. Analytical scoring aims to assess specific aspects of writing, such as vocabulary, grammar, coherence, etc., and assign separate scores or ratings for each aspect. This is crucial for providing specific feedback, highlighting learners' strengths and weaknesses to facilitate their progress. However, because analytical scoring is more complex and time-consuming than holistic scoring, automated systems face greater difficulties in learning and predicting "noisy" real-world scores. Therefore, this paper explores the possibility of inferring analytical scores from holistic scores using a large language model like GPT-4, in the absence of direct analytical scores.