Prompting open-source and commercial language models for grammatical error correction of English learner text

Christopher Davis,Andrew Caines,Øistein Andersen,Shiva Taslimipoor,Helen Yannakoudakis,Zheng Yuan,Christopher Bryant,Marek Rei,Paula Buttery

2024-01-15

Abstract:Thanks to recent advances in generative AI, we are able to prompt large language models (LLMs) to produce texts which are fluent and grammatical. In addition, it has been shown that we can elicit attempts at grammatical error correction (GEC) from LLMs when prompted with ungrammatical input sentences. We evaluate how well LLMs can perform at GEC by measuring their performance on established benchmark datasets. We go beyond previous studies, which only examined GPT* models on a selection of English GEC datasets, by evaluating seven open-source and three commercial LLMs on four established GEC benchmarks. We investigate model performance and report results against individual error types. Our results indicate that LLMs do not always outperform supervised English GEC models except in specific contexts -- namely commercial LLMs on benchmarks annotated with fluency corrections as opposed to minimal edits. We find that several open-source models outperform commercial ones on minimal edit benchmarks, and that in some settings zero-shot prompting is just as competitive as few-shot prompting.

Computation and Language

What problem does this paper attempt to address?

The paper attempts to address the issue of evaluating the performance of large language models (LLMs) in the task of English grammatical error correction (GEC) and comparing it with existing supervised learning GEC models. Specifically, the researchers focus on the following points: 1. **Scope of Evaluation**: The study is not limited to the previous examinations of the GPT series models on some English GEC datasets but extends to evaluate seven open-source and three commercial LLMs on four established GEC benchmarks. 2. **Evaluation Method**: The study guides LLMs to perform minimal edit style corrections through zero-shot and few-shot prompting. This style of correction aims to retain the original expression and word choice, correcting only grammatical errors rather than rewriting the text for fluency. 3. **Performance Comparison**: The researchers pay particular attention to the performance of LLMs on different types of errors and compare the results with the standards of individual error types to assess whether LLMs can surpass supervised learning GEC models in specific contexts. 4. **Educational Applications**: The paper also explores the value of these models in the educational field, particularly how they can assist second language learners in English writing through means such as instant feedback, automatic grading, and personalized learning. In summary, the core issue of the paper is to evaluate the effectiveness and applicability of LLMs in the task of English grammatical error correction, especially their potential in educational technology applications.

Prompting open-source and commercial language models for grammatical error correction of English learner text

Evaluating Prompting Strategies for Grammatical Error Correction Based on Language Proficiency

GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning

LM-Critic: Language Models for Unsupervised Grammatical Error Correction

Evaluating LLMs' grammatical error correction performance in learner Chinese

Large Language Models Are State-of-the-Art Evaluator for Grammatical Error Correction

Analyzing the Performance of GPT-3.5 and GPT-4 in Grammatical Error Correction

To Err Is Human, but Llamas Can Learn It Too

Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting

Rethinking the Roles of Large Language Models in Chinese Grammatical Error Correction

A Simple Recipe for Multilingual Grammatical Error Correction

Evolutionary Prompt Design for LLM-Based Post-ASR Error Correction

Evaluation of really good grammatical error correction

An Analysis of GPT-3's Performance in Grammatical Error Correction

Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT.

Prompt Engineering or Fine Tuning: An Empirical Assessment of Large Language Models in Automated Software Engineering Tasks

ChatGPT for Arabic Grammatical Error Correction

Pillars of Grammatical Error Correction: Comprehensive Inspection Of Contemporary Approaches In The Era of Large Language Models

Exploring the Capabilities of Prompted Large Language Models in Educational and Assessment Applications

Grammar Prompting for Domain-Specific Language Generation with Large Language Models