An Alignment-Agnostic Model for Chinese Text Error Correction

Liying Zheng,Yue Deng,Weishun Song,Liang Xu,Jing Xiao
DOI: https://doi.org/10.48550/arXiv.2104.07190
2021-09-18
Abstract:This paper investigates how to correct Chinese text errors with types of mistaken, missing and redundant characters, which is common for Chinese native speakers. Most existing models based on detect-correct framework can correct mistaken characters errors, but they cannot deal with missing or redundant characters. The reason is that lengths of sentences before and after correction are not the same, leading to the inconsistence between model inputs and outputs. Although the Seq2Seq-based or sequence tagging methods provide solutions to the problem and achieved relatively good results on English context, but they do not perform well in Chinese context according to our experimental results. In our work, we propose a novel detect-correct framework which is alignment-agnostic, meaning that it can handle both text aligned and non-aligned occasions, and it can also serve as a cold start model when there are no annotated data provided. Experimental results on three datasets demonstrate that our method is effective and achieves the best performance among existing published models.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper aims to solve the problem of error correction in Chinese texts, especially for three common error types among Chinese - native speakers during the writing process: mistaken characters, missing characters, and redundant characters. The existing models based on the detection - correction framework can handle mistaken characters relatively well, but they are not effective in dealing with errors such as missing characters and redundant characters that cause text misalignment, because there is an inconsistency between the input and output of these models. In addition, although the sequence - to - sequence (Seq2Seq) or sequence - tagging methods perform relatively well in dealing with these three error types in the English context, the experimental results in the Chinese context are not satisfactory. For this reason, the author proposes a new alignment - agnostic detect - correct framework. This framework can not only handle both text - aligned and non - aligned situations, but also can serve as a cold - start model to provide services without labeled data. The experimental results on three datasets show that this method performs better than most recently published models.