Evaluating the Capability of Large-scale Language Models on Chinese Grammatical Error Correction Task

Fanyi Qu,Yunfang Wu
2023-07-08
Abstract:Large-scale language models (LLMs) has shown remarkable capability in various of Natural Language Processing (NLP) tasks and attracted lots of attention recently. However, some studies indicated that large language models fail to achieve promising result beyond the state-of-the-art models in English grammatical error correction (GEC) tasks. In this report, we aim to explore the how large language models perform on Chinese grammatical error correction tasks and provide guidance for future work. We conduct experiments with 3 different LLMs of different model scale on 4 Chinese GEC dataset. Our experimental results indicate that the performances of LLMs on automatic evaluation metrics falls short of the previous sota models because of the problem of over-correction. Furthermore, we also discover notable variations in the performance of LLMs when evaluated on different data distributions. Our findings demonstrates that further investigation is required for the application of LLMs on Chinese GEC task.
Computation and Language
What problem does this paper attempt to address?
This paper aims to explore the performance of large - scale language models (LLMs) on the Chinese grammar error correction task (Chinese GEC task) and provide guidance for future research. Specifically, the author experimentally analyzed the performance of LLMs of different scales on four Chinese GEC datasets, evaluated the performance of these models on automatic evaluation metrics (such as F0.5 score), and found the following main problems: 1. **Over - correction problem**: The performance of LLMs in the Chinese GEC task is not as good as the existing state - of - the - art models. The main reason is that LLMs tend to make unnecessary modifications to make the input sentences more fluent, which may lead to over - correction problems and sometimes even change the original semantics of the input sentences. 2. **Influence of data distribution**: There are significant differences in the performance of LLMs on different data distributions. For example, on the datasets of Chinese learners, the performance of LLMs is significantly better than that on the datasets of native - speaker exams. This is because the grammar errors of Chinese learners are mainly concentrated on the misuse of similar words or phrases, while the errors in the native - speaker exam datasets are more complex and involve more structural errors. 3. **Influence of model scale**: There are also differences in the performance of LLMs of different scales. The experimental results show that ChatGPT has a significant improvement in recall, while its precision is comparable to that of other smaller models. This indicates that models of different scales have large differences in error - detection capabilities. In summary, the problems that this paper attempts to solve are: What is the actual performance of large - scale language models in the Chinese grammar error correction task? Are there specific problems that need further research and improvement?