Abstract:Large Language Models (LLMs) have been reported to outperform existing automatic evaluation metrics in some tasks, such as text summarization and machine translation. However, there has been a lack of research on LLMs as evaluators in grammatical error correction (GEC). In this study, we investigate the performance of LLMs in GEC evaluation by employing prompts designed to incorporate various evaluation criteria inspired by previous research. Our extensive experimental results demonstrate that GPT-4 achieved Kendall's rank correlation of 0.662 with human judgments, surpassing all existing methods. Furthermore, in recent GEC evaluations, we have underscored the significance of the LLMs scale and particularly emphasized the importance of fluency among evaluation criteria.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to explore the performance of large language models (LLMs) in the evaluation of Grammatical Error Correction (GEC). Specifically, the authors investigate the performance of LLMs in GEC evaluation by designing different prompts and combining various evaluation criteria proposed in previous studies. #### Main problems 1. **Lack of research on LLMs as GEC evaluation tools**: Although LLMs have shown the ability to outperform existing automatic evaluation metrics in tasks such as text summarization, dialogue generation, and machine translation, there is still a lack of relevant research in GEC evaluation. 2. **The influence of evaluation criteria**: The authors pay special attention to the influence of evaluation criteria (such as fluency, grammatical correctness, meaning preservation, etc.) on the evaluation performance of LLMs. 3. **The influence of model scale**: The study also explores the performance differences of LLMs of different scales (such as the smaller Llama 2 and the larger GPT - 4) in GEC evaluation. #### Research objectives - **Verify the effectiveness of LLMs in GEC evaluation**: Verify through experiments whether LLMs (especially GPT - 4) can surpass existing evaluation methods and are highly correlated with human judgment. - **Evaluate the performance of LLMs of different scales**: Analyze the performance of LLMs of different scales in GEC evaluation, especially their ability to handle fluency - corrected sentences. - **Optimize evaluation prompts**: Explore how to improve the performance of LLMs in GEC evaluation by designing different prompts, especially by introducing specific evaluation criteria (such as fluency, grammatical correctness, etc.). #### Experimental results - **GPT - 4 performs excellently**: GPT - 4 shows a high correlation in both system - level and sentence - level evaluations. Especially under the prompt considering fluency, its Kendall rank correlation coefficient reaches 0.662, which is significantly better than other methods. - **The influence of model scale is significant**: As the scale of LLMs decreases, its correlation with human judgment also decreases. Especially when handling fluency - corrected sentences, smaller - scale models perform worse than large - scale models. - **The importance of evaluation criteria**: Prompts introducing specific evaluation criteria (such as fluency, grammatical correctness, etc.) can significantly improve the evaluation performance of LLMs, indicating that these criteria are crucial for improving evaluation quality. In conclusion, through systematic experiments and analysis, this paper proves the potential of LLMs (especially GPT - 4) in GEC evaluation and emphasizes the important influence of evaluation criteria and model scale on the evaluation effect.

Large Language Models Are State-of-the-Art Evaluator for Grammatical Error Correction