Visual and Phonological Feature Enhanced Siamese BERT for Chinese Spelling Error Correction

Yujia Liu,Hongliang Guo,Shuai Wang,Tiejun Wang
DOI: https://doi.org/10.3390/app12094578
2022-01-01
Applied Sciences
Abstract:Chinese Spelling Check (CSC) aims to detect and correct spelling errors in Chinese. Most CSC models rely on human-defined confusion sets to narrow the search space, failing to resolve errors outside the confusion set. However, most spelling errors in current benchmark datasets are character pairs in similar pronunciations. Errors in similar shapes and errors which are visually and phonologically irrelevant are not considered. Furthermore, widely-used automatically generated training data in CSC tasks leads to label leakage and unfair comparison between different methods. In this work, we propose a feature (visual and phonological) enhanced siamese BERT to (1) correct spelling errors without using confusion sets; (2) integrate phonological and visual features for CSC by a glyph graph; (3) improve performance for unseen spelling errors. To evaluate CSC methods fairly and comprehensively, we build a large-scale CSC dataset in which the number of samples in different error types is the same. The experimental results show that the proposed approach achieves better performance compared with previous state-of-the-art methods on three benchmark datasets and the new error-type balanced dataset.
What problem does this paper attempt to address?