A Hybrid Approach Towards Chinese Spelling and Splitting Error Correction

Junhong Liang,Junnan Zhu,Feifei Zhai,Nanchang Cheng,Chengqing Zong,Yu Zhou
DOI: https://doi.org/10.3233/faia240952
2024-01-01
Abstract:Existing Chinese spelling check (CSC) methods have limitations in correcting variable-length error characters, requiring the input and output to be the same length. They mainly focus on modelling Chinese characters’ phonetic information and generating candidates for each position. In contrast, few approaches delve into the intricacies of splitting Chinese characters to address glyph errors and splitting variable-length corrections. We define the Chinese Splitting Error Correction (CSEC) task and develop CSEC datasets in news and social media domains to address this issue. We then propose Soft-Masked Multi-feature Error Correction (SoMu) model, which first generates semantic, phonetic, graphic, and unique Chinese Wubi embeddings, then integrates those features through selective gating fusion, followed by a soft-mask strategy to filter incorrect tokens and finally use transformer layers to predict the correct ones. This model effectively addresses both spelling and splitting errors. Extensive analysis shows that our model significantly improves character-splitting information modelling for CSEC. Our dataset is available at https://github.com/Skywalker-Harrison/SoMu.
What problem does this paper attempt to address?