Multi-font Printed Mongolian Document Recognition System
Liangrui Peng,Changsong Liu,Xiaoqing Ding,Jianming Jin,Youshou Wu,Hua Wang,Yanhua Bao
DOI: https://doi.org/10.1007/s10032-009-0106-8
2009-01-01
Abstract:Mongolian is one of the most common written languages in China, Mongolia, and Russia. Many printed Mongolian documents still remain to be digitized for digital library applications. The traditional Mongolian script has a unique vertical cursive writing style and multiple font variations, which makes Mongolian Optical Character Recognition challenging. As the traditional Mongolian script has subcomponent characteristics, such that one character may be a constituent of another character, in this work we define a novel character set for recognition using segmented components. The components are combined into characters in a rule-based post-processing module. For overall character recognition, a method based on Visual Directional Features and multi-level classifiers is presented. For character segmentation, segmentation points are identified by analyzing the properties of projection profiles and connected components. Mongolian has dozens of different printed font types that can be categorized into two major groups, namely, standard and handwritten-style groups. The segmentation parameters are adjusted for each group. Additionally, script identification and relevant character recognition kernels are integrated for the recognition of Mongolian text mixed with Chinese and English. A novel multi-font printed Mongolian document recognition system based on the proposed methods is implemented. Experiments indicate a text recognition rate of 96.9% on the test samples from real documents with multiple font types and mixed script. The proposed methods can also be applied to other scripts in the Mongolian script family, such as Todo and Sibe, with significant potential for extension to historic Mongolian documents.