Abstract:Large Language Models for Code (LLMs4Code) have been found to exhibit outstanding performance in the software engineering domain, especially the remarkable performance in coding tasks. However, even the most advanced LLMs4Code can inevitably contain incorrect or outdated code knowledge. Due to the high cost of training LLMs4Code, it is impractical to re-train the models for fixing these problematic code knowledge. Model editing is a new technical field for effectively and efficiently correcting erroneous knowledge in LLMs, where various model editing techniques and benchmarks have been proposed recently. Despite that, a comprehensive study that thoroughly compares and analyzes the performance of the state-of-the-art model editing techniques for adapting the knowledge within LLMs4Code across various code-related tasks is notably absent. To bridge this gap, we perform the first systematic study on applying state-of-the-art model editing approaches to repair the inaccuracy of LLMs4Code. To that end, we introduce a benchmark named CLMEEval, which consists of two datasets, i.e., CoNaLa-Edit (CNLE) with 21K+ code generation samples and CodeSearchNet-Edit (CSNE) with 16K+ code summarization samples. With the help of CLMEEval, we evaluate six advanced model editing techniques on three LLMs4Code: CodeLlama (7B), CodeQwen1.5 (7B), and Stable-Code (3B). Our findings include that the external memorization-based GRACE approach achieves the best knowledge editing effectiveness and specificity (the editing does not influence untargeted knowledge), while generalization (whether the editing can generalize to other semantically-identical inputs) is a universal challenge for existing techniques. Furthermore, building on in-depth case analysis, we introduce an enhanced version of GRACE called A-GRACE, which incorporates contrastive learning to better capture the semantics of the inputs.

Evaluating Performance of LLaMA2 Large Language Model Enhanced by QLoRA Fine-Tuning for English Grammatical Error Correction.

Large Language Models Are State-of-the-Art Evaluator for Grammatical Error Correction

Evaluating the Capability of Large-scale Language Models on Chinese Grammatical Error Correction Task

Rethinking the Roles of Large Language Models in Chinese Grammatical Error Correction

Evaluating LLMs' grammatical error correction performance in learner Chinese

On the (In)Effectiveness of Large Language Models for Chinese Text Correction

Towards Reliable and Fluent Large Language Models: Incorporating Feedback Learning Loops in QA Systems

Prompting open-source and commercial language models for grammatical error correction of English learner text

GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning

Optimizing and Fine-tuning Large Language Model for Urban Renewal

LM-Critic: Language Models for Unsupervised Grammatical Error Correction

LLaMA Beyond English: An Empirical Study on Language Capability Transfer

Investigating Automatic Scoring and Feedback using Large Language Models

Model Editing for LLMs4Code: How Far are We?

Evaluating LLMs at Detecting Errors in LLM Responses

Leveraging Denoised Abstract Meaning Representation for Grammatical Error Correction

A Simple Recipe for Multilingual Grammatical Error Correction

Guiding Large Language Models to Post-Edit Machine Translation with Error Annotations

ChatGPT for Arabic Grammatical Error Correction

Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT.

Evaluating the Performance of Large Language Models on GAOKAO Benchmark