Th-Ocr: System For Multilingual Document Analysis, Recognition And Reconstruction

Lr Peng,Xq Ding,Cs Liu,M Chen,C Fang
2002-01-01
Abstract:This paper presents the framework and key technologies of TH-OCR software system for multilingual (Chinese, English, Japanese, Korean) document analysis, recognition and reconstruction. The software can convert scanned document image into machine-readable document while preserving its original layout with high recognition rate. The key technologies of TH-OCR include high performance multilingual character recognition kernel, Chinese (Japanese/Korean)-English mixed-script character segmentation technique, automatic layout analysis, understanding, and reconstruction, etc. It is a useful tool to digitize large scale of documents for application such as digital library, electronic publication via Internet or CD-ROM.
What problem does this paper attempt to address?