Abstract:Mongolian is one of the most common written languages in China, Mongolia, and Russia. Many printed Mongolian documents still remain to be digitized for digital library applications. The traditional Mongolian script has a unique vertical cursive writing style and multiple font variations, which makes Mongolian Optical Character Recognition challenging. As the traditional Mongolian script has subcomponent characteristics, such that one character may be a constituent of another character, in this work we define a novel character set for recognition using segmented components. The components are combined into characters in a rule-based post-processing module. For overall character recognition, a method based on Visual Directional Features and multi-level classifiers is presented. For character segmentation, segmentation points are identified by analyzing the properties of projection profiles and connected components. Mongolian has dozens of different printed font types that can be categorized into two major groups, namely, standard and handwritten-style groups. The segmentation parameters are adjusted for each group. Additionally, script identification and relevant character recognition kernels are integrated for the recognition of Mongolian text mixed with Chinese and English. A novel multi-font printed Mongolian document recognition system based on the proposed methods is implemented. Experiments indicate a text recognition rate of 96.9% on the test samples from real documents with multiple font types and mixed script. The proposed methods can also be applied to other scripts in the Mongolian script family, such as Todo and Sibe, with significant potential for extension to historic Mongolian documents.

Character Segmentation for Classical Mongolian Words in Historical Documents.

Segmentation and Recognition for Historical Tibetan Document Images

Local Projection-Based Character Segmentation Method for Historical Chinese Documents.

Touching Character Segmentation Method For Chinese Historical Documents

Multi-font Printed Mongolian Document Recognition System

Graph Model Optimization Based Historical Chinese Character Segmentation Method

Multi-Step Segmentation Method Based on Adaptive Thresholds for Chinese Calligraphy Characters.

A General Framework For Multi-Character Segmentation And Its Application In Recognizing Multilingual Asian Documents

HRCenterNet: An Anchorless Approach to Chinese Character Segmentation in Historical Documents

A Sequence Labeling Based Approach for Character Segmentation of Historical Documents

A Novel Approach Of Segmenting Touching And Kerned Characters

Character segmentation and restoration of Qin-Han bamboo slips using local auto-focus thresholding method

Cross-Language Sensitive Words Distribution Map: A Novel Recognition-Based Document Understanding Method for Uighur and Tibetan

Weakly Supervised Precise Segmentation for Historical Document Images.

A Novel Short Merged Off-line Handwritten Chinese Character String Segmentation Algorithm Using Hidden Markov Model

A Discriminative Latent Variable Chinese Segmenter with Hybrid Word/Character Information.

Word Segmentation for Classical Chinese Buddhist Literature

SegHist: A General Segmentation-based Framework for Chinese Historical Document Text Line Detection

Offline handwritten arabic character segmentation with probabilistic model

Accurate Fine-grained Layout Analysis for the Historical Tibetan Document Based on the Instance Segmentation

Layout Analysis Algorithm Based on Probabilistic Graphical Model for Dunhuang Historical Documents