Abstract:Mongolian is one of the most common written languages in China, Mongolia, and Russia. Many printed Mongolian documents still remain to be digitized for digital library applications. The traditional Mongolian script has a unique vertical cursive writing style and multiple font variations, which makes Mongolian Optical Character Recognition challenging. As the traditional Mongolian script has subcomponent characteristics, such that one character may be a constituent of another character, in this work we define a novel character set for recognition using segmented components. The components are combined into characters in a rule-based post-processing module. For overall character recognition, a method based on Visual Directional Features and multi-level classifiers is presented. For character segmentation, segmentation points are identified by analyzing the properties of projection profiles and connected components. Mongolian has dozens of different printed font types that can be categorized into two major groups, namely, standard and handwritten-style groups. The segmentation parameters are adjusted for each group. Additionally, script identification and relevant character recognition kernels are integrated for the recognition of Mongolian text mixed with Chinese and English. A novel multi-font printed Mongolian document recognition system based on the proposed methods is implemented. Experiments indicate a text recognition rate of 96.9% on the test samples from real documents with multiple font types and mixed script. The proposed methods can also be applied to other scripts in the Mongolian script family, such as Todo and Sibe, with significant potential for extension to historic Mongolian documents.

Classical Mongolian Words Recognition in Historical Document

Multi-font Printed Mongolian Document Recognition System

Segmentation and Recognition for Historical Tibetan Document Images

A Character Recognition Scheme Based on Object Oriented Design for Tibetan Buddhist Texts.

Cross-Language Sensitive Words Distribution Map: A Novel Recognition-Based Document Understanding Method for Uighur and Tibetan

Word Level Script Recognition for Uighur Document Mixed with English Script.

Uyghur, Chinese and English Multilingual Document Recognition

Layout Analysis Algorithm Based on Probabilistic Graphical Model for Dunhuang Historical Documents

Chinese Documents Classification Based on N-Grams

Local Projection-Based Character Segmentation Method for Historical Chinese Documents.

Touching Character Segmentation Method For Chinese Historical Documents

Gaussian Process Style Transfer Mapping for Historical Chinese Character Recognition

Learn More Manchu Words with A New Visual-Language Framework

Multilingual document recognition research and its application in China

A Handwritten Character Extraction Algorithm for Multi-language Document Image

A Sequence Labeling Based Approach for Character Segmentation of Historical Documents

RNN Based Uyghur Text Line Recognition and Its Training Strategy

A human-inspired recognition system for premodern Japanese historical documents

Accurate Fine-grained Layout Analysis for the Historical Tibetan Document Based on the Instance Segmentation

A General Framework For Multi-Character Segmentation And Its Application In Recognizing Multilingual Asian Documents

Design And Development Of An Ancient Chinese Document Recognition System