Building a Japanese Typo Dataset and Typo Correction System Based on Wikipedia’s Revision History

Yu Tanaka,Yugo Murawaki,Daisuke Kawahara,Sadao Kurohashi
DOI: https://doi.org/10.5715/jnlp.28.995
2021-01-01
Journal of Natural Language Processing
Abstract:Correcting typographical errors (typos) is important to mitigate errors in downstream natural language processing tasks. Although a large number of typo–correction pairs are required to develop typo correction systems, no such dataset is available for the Japanese language. Previous studies on building French and English typo datasets have exploited Wikipedia. To collect typos, the aforementioned methods apply a spell checker to words changed during revisions. As the lack of word delimiters in Japanese hinders the application of a spell checker, these methods cannot be applied directly to Japanese. In this study, we build a Japanese typo dataset from Wikipedia’s revision history. We address the aforementioned problem by combining character-based extraction rules and various filtering methods. We evaluate our construction method using which we obtain over 700K typo–correction sentence pairs. Using the new dataset, we also build typo correction systems with a sequence-to-sequence pretrained model. As an auxiliary task for fine-tuning, we train the model to predict the readings of kanji, leading to a higher accuracy in the correction of erroneous kanji conversion. We also investigate the effect of pseudo training data. Finally, we demonstrate the higher accuracy achieved by our system for the typo recognition task compared with other proofreading systems.
What problem does this paper attempt to address?