Abstract:Although rarely stated, in practice, Grammatical Error Correction (GEC) encompasses various models with distinct objectives, ranging from grammatical error detection to improving fluency. Traditional evaluation methods fail to fully capture the full range of system capabilities and objectives. Reference-based evaluations suffer from limitations in capturing the wide variety of possible correction and the biases introduced during reference creation and is prone to favor fixing local errors over overall text improvement. The emergence of large language models (LLMs) has further highlighted the shortcomings of these evaluation strategies, emphasizing the need for a paradigm shift in evaluation methodology. In the current study, we perform a comprehensive evaluation of various GEC systems using a recently published dataset of Swedish learner texts. The evaluation is performed using established evaluation metrics as well as human judges. We find that GPT-3 in a few-shot setting by far outperforms previous grammatical error correction systems for Swedish, a language comprising only 0.11% of its training data. We also found that current evaluation methods contain undesirable biases that a human evaluation is able to reveal. We suggest using human post-editing of GEC system outputs to analyze the amount of change required to reach native-level human performance on the task, and provide a dataset annotated with human post-edits and assessments of grammaticality, fluency and meaning preservation of GEC system outputs.

Do Grammatical Error Correction Models Realize Grammatical Generalization?

The Unbearable Weight of Generating Artificial Errors for Grammatical Error Correction

Comparison of Grammatical Error Correction Using Back-Translation Models

Grammatical Error Correction: A Survey of the State of the Art

Evaluation of large-scale synthetic data for Grammar Error Correction

Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule

Interpretability for Language Learners Using Example-Based Grammatical Error Correction

Adversarial Grammatical Error Correction

Is ChatGPT a Highly Fluent Grammatical Error Correction System? A Comprehensive Evaluation

Leveraging Denoised Abstract Meaning Representation for Grammatical Error Correction

Improving Grammatical Error Correction Models with Purpose-Built Adversarial Examples

Judge a Sentence by Its Content to Generate Grammatical Errors

Evaluation of really good grammatical error correction

Wronging a Right: Generating Better Errors to Improve Grammatical Error Detection

A Syntax-Guided Grammatical Error Correction Model with Dependency Tree Correction

Grammatical Error Correction as GAN-like Sequence Labeling

A Unified Strategy for Multilingual Grammatical Error Correction with Pre-trained Cross-Lingual Language Model

A Chinese Grammatical Error Correction Model Based On Grammatical Generalization And Parameter Sharing

Grammatical Error Correction via Mixed-Grained Weighted Training

Detection-Correction Structure via General Language Model for Grammatical Error Correction

A Simple but Effective Classification Model for Grammatical Error Correction.