Abstract:Large-scale language models (LLMs) has shown remarkable capability in various of Natural Language Processing (NLP) tasks and attracted lots of attention recently. However, some studies indicated that large language models fail to achieve promising result beyond the state-of-the-art models in English grammatical error correction (GEC) tasks. In this report, we aim to explore the how large language models perform on Chinese grammatical error correction tasks and provide guidance for future work. We conduct experiments with 3 different LLMs of different model scale on 4 Chinese GEC dataset. Our experimental results indicate that the performances of LLMs on automatic evaluation metrics falls short of the previous sota models because of the problem of over-correction. Furthermore, we also discover notable variations in the performance of LLMs when evaluated on different data distributions. Our findings demonstrates that further investigation is required for the application of LLMs on Chinese GEC task.

What problem does this paper attempt to address?

This paper aims to explore the performance of large - scale language models (LLMs) on the Chinese grammar error correction task (Chinese GEC task) and provide guidance for future research. Specifically, the author experimentally analyzed the performance of LLMs of different scales on four Chinese GEC datasets, evaluated the performance of these models on automatic evaluation metrics (such as F0.5 score), and found the following main problems: 1. **Over - correction problem**: The performance of LLMs in the Chinese GEC task is not as good as the existing state - of - the - art models. The main reason is that LLMs tend to make unnecessary modifications to make the input sentences more fluent, which may lead to over - correction problems and sometimes even change the original semantics of the input sentences. 2. **Influence of data distribution**: There are significant differences in the performance of LLMs on different data distributions. For example, on the datasets of Chinese learners, the performance of LLMs is significantly better than that on the datasets of native - speaker exams. This is because the grammar errors of Chinese learners are mainly concentrated on the misuse of similar words or phrases, while the errors in the native - speaker exam datasets are more complex and involve more structural errors. 3. **Influence of model scale**: There are also differences in the performance of LLMs of different scales. The experimental results show that ChatGPT has a significant improvement in recall, while its precision is comparable to that of other smaller models. This indicates that models of different scales have large differences in error - detection capabilities. In summary, the problems that this paper attempts to solve are: What is the actual performance of large - scale language models in the Chinese grammar error correction task? Are there specific problems that need further research and improvement?

Evaluating the Capability of Large-scale Language Models on Chinese Grammatical Error Correction Task

On the (In)Effectiveness of Large Language Models for Chinese Text Correction

Rethinking the Roles of Large Language Models in Chinese Grammatical Error Correction

Evaluating LLMs' grammatical error correction performance in learner Chinese

Evaluating Performance of LLaMA2 Large Language Model Enhanced by QLoRA Fine-Tuning for English Grammatical Error Correction.

Large Language Models Are State-of-the-Art Evaluator for Grammatical Error Correction

LLMCL-GEC: Advancing Grammatical Error Correction with LLM-Driven Curriculum Learning

GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning

DSGram: Dynamic Weighting Sub-Metrics for Grammatical Error Correction in the Era of Large Language Models

A Chinese Grammatical Error Correction Model Based On Grammatical Generalization And Parameter Sharing

FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction

LM-Critic: Language Models for Unsupervised Grammatical Error Correction

A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

CMMLU: Measuring massive multitask language understanding in Chinese

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

A Survey on Evaluation of Large Language ModelsJust Accepted

A Survey on Evaluation of Large Language Models

Full-text Error Correction for Chinese Speech Recognition with Large Language Model

Are Large Language Models Good Fact Checkers: A Preliminary Study

ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction