Character Segmentation for Classical Mongolian Words in Historical Documents.

Xiangdong Su,Guanglai Gao,Weihua Wang,Feilong Bao,Hongxi Wei
DOI: https://doi.org/10.1007/978-3-662-45643-9_49
2014-01-01
Abstract:There are many classical Mongolian historical documents which are reserved in image form, and as a result it is inconvenient for us to search and mining the desired content. In order to facilitate the word recognition in the document digitization procedure, this paper proposes a novel approach to segment the historical words in which the characters are intrinsically connected together and possess remarkable overlapping and variation. The approach consist of three steps: (1) significant contour point (SCP) detection on the approximated polygon of the word's external contour, (2) baseline locating based on the logistic regression model and (3) segment path generation and validation based on the heuristic rules and the neural network. The SCP helps in the baseline locating and segment path generation. Experiment on the historical Mongolian Kanjur demonstrates that our approach could effectively locate the words' baselines and segment the words into characters.
What problem does this paper attempt to address?