Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation

Soyoung Yoon,Sungjoon Park,Gyuwan Kim,Junhee Cho,Kihyo Park,Minjoon Seo,Alice Oh,Gyu Tae Kim
DOI: https://doi.org/10.48550/arXiv.2210.14389
2022-10-25
Computation and Language
Abstract:Research on Korean grammatical error correction (GEC) is limited compared to other major languages such as English and Chinese. We attribute this problematic circumstance to the lack of a carefully designed evaluation benchmark for Korean. Thus, in this work, we first collect three datasets from different sources (Kor-Lang8, Kor-Native, and Kor-Learner) to cover a wide range of error types and annotate them using our newly proposed tool called Korean Automatic Grammatical error Annotation System (KAGAS). KAGAS is a carefully designed edit alignment & classification tool that considers the nature of Korean on generating an alignment between a source sentence and a target sentence, and identifies error types on each aligned edit. We also present baseline models fine-tuned over our datasets. We show that the model trained with our datasets significantly outperforms the public statistical GEC system (Hanspell) on a wider range of error types, demonstrating the diversity and usefulness of the datasets.
What problem does this paper attempt to address?