AMRE: an Attention-Based CRNN for Manchu Word Recognition on a Woodblock-Printed Dataset
Zhiwei Wang,Siyang Lu,Mingquan Wang,Xiang Wei,Yingjun Qi
DOI: https://doi.org/10.1007/978-3-031-30108-7_23
2023-01-01
Abstract:Ancient minority language character recognition could be challenging due to limited documentation, but it is always critical to better understand history and conduct social science researches. As an important minority language, Manchu language is confronted with the similar challenges due to the lack of systematic document studies. Recently, more researches focus on solving this problem through different approaches, such as document digitalization or character image segmentation. However, there are still some limitations. On one hand, existing digitalized Manchu documents are carried out based upon machine-printed style, which is not common in real historical documents and can cause severe recognition bias. On the other hand, most of Manchu identification methods are based on coarse image segmentation and may result in recognition error since it is difficult to consistently cut the words accurately. To tackle these two challenges, we propose a segmentation-free method for Manchu recognition with a medium scale dataset of Woodblock-printed Manchu Words (WMW). We first develop WMW based-upon woodblock-printed Manchu words, which are more common in ancient documents. With the developed dataset, we conduct document mining and carry out a framework, namely AMRE, with Attention-based Convolutional Recurrent Neural Network. AMRE leverages attention mechanism by weighted aggregation of the convolution results from differently sized kernels and more effectively mine the valid information of morphed words in recognition process. By implementing our proposed AMRE, the digitalized characters can be more accurately recognized. The experiment results show that the word recognition accuracy of AMRE exceeds the baseline by more than 5 % .