Touching Character Segmentation Method For Chinese Historical Documents

Xiaolu Sun,Liangrui Peng,Xiaoqing Ding
DOI: https://doi.org/10.1117/12.840251
2010-01-01
Abstract:The OCR technology for Chinese historical documents is still an open problem. As these documents are hand-written or hand-carved in various styles, overlapped and touching characters bring great difficulty for character segmentation module. This paper presents an over-segmentation-based method to handle the overlapped and touching Chinese characters in historic documents. The whole segmentation process includes two parts: over-segmented and segmenting path optimization. In the former part, touching strokes will be found and segmented by analyzing the geometric information of the white and black connected components. The segmentation cost of the touching strokes is estimated with connected components' shape and location, as well as the touching stroke width. The latter part uses local optimization dynamic programming to find best segmenting path. HMM is used to express the multiple choices of segmenting paths, and Viterbi algorithm is used to search local optimal solution. Experimental results on practical Chinese documents show the proposed method is effective.
What problem does this paper attempt to address?