Syllable Analysis Data Augmentation for Khmer Ancient Palm Leaf Recognition

Nimol Thuon,Jun Du,Jianshu Zhang
DOI: https://doi.org/10.23919/apsipaasc55919.2022.9980217
2022-01-01
Abstract:The unique forms and physical conditions of the Khmer palm leaf manuscript recognition system are receiving more attention from researchers. In the state-of-the-art, data augmentation is commonly used for data training; however, grammatical mistakes and data availability in the training process would determine or limit the accuracy rate. The two significant challenges lie in (1) grammar complexity and (2) wording similarity; therefore, this paper presents the Syllable Analysis Data Augmentation (SADA) technique, which aims at boosting the accuracy of the text recognition system for one of Southeast Asia's historical manuscripts from Cambodia. SADA comprises two fundamental modules: (1) formulating a collection of syllables/words to structure glyph patterns and (2) generating patterns from existing data through augmentation techniques and utilizing flexible geometric image transformation to increase similar word/text images. Initially, image collections are established, whereby datasets are interpreted according to the reordered grammatical structures to construct multiple glyph images. Next, we aim at conducting the experiment with a text/word recognition system before regulating attention-based encoder-decoder to enhance the probability of transcriptions of low and high-resolution images. At last, the experiment centers on datasets from various sources, including public datasets from ICFHR 2018 contest and our new augmentation datasets, all of which aim at demonstrating and evaluating the accuracy of the findings.
What problem does this paper attempt to address?