Sub-Word Based Mongolian Offline Handwriting Recognition

Daoerji Fan,Guanglai Gao,Huijuan Wu
DOI: https://doi.org/10.1109/ICDAR.2019.00048
2019-01-01
Abstract:Mongolian is an agglutinative language, which re-sults in a large number of words derived from the same stems connecting different suffixes. This morphological richness leads to high out-of-vocabulary (OOV) rates and causes problems of data sparsity. In this paper, our proposed recognition system is composed of three parts: handwritten image preprocessing, mapping of images to grapheme sequences, and sub-word-based language model (LM) decoding. We present a sub-word-based n-gram LM to solve the high OOV rate problem. According to the characteristics of Mongolian, we modified the traditional token passing algorithm to improve decoding speed and to easy to combine with any n-gram LM. We evaluated the performance of sub-words at different levels on the open Mongolian offline handwriting dataset (MHW). The bi-syllable 2-gram LM showed the best performance, with 18.32% and 23.22% word-error rates (WERs) on two test sets. Our various experiments show that, this method can predict in vocabulary words with a higher accuracy rate and also predict OOV words with a certain accuracy rate.
What problem does this paper attempt to address?