Cross-Language Sensitive Words Distribution Map: A Novel Recognition-Based Document Understanding Method for Uighur and Tibetan

Bing Su,Xiaoqing Ding,Liangrui Peng,Changsong Liu
DOI: https://doi.org/10.1109/icdar.2013.58
2013-01-01
Abstract:Cross-language document recognition and understanding have urgent realistic needs and extensive application prospects. In this paper, we propose a novel recognition-based Uighur and Tibetan document understanding method, termed "cross-language sensitive words distribution map" (CSWDM). In our unified recognition-understanding framework, digital Uighur/Tibetan document images are first recognized using OCR technology, and then CSWDM labels the Chinese information of sensitive words on the recognized transcriptions or directly on the original digital images, thus the space location and occurrence frequency of these sensitive words can be intuitively represented. With such information, readers can roughly understand the theme and meaning of the cross-language documents.
What problem does this paper attempt to address?