Persian printed text line detection based on font size

Amirreza Fateh,Mohsen Rezvani,Alireza Tajary,Mansoor Fateh
DOI: https://doi.org/10.1007/s11042-022-13243-x
IF: 2.577
2022-06-24
Multimedia Tools and Applications
Abstract:Text line segmentation is an essential step in the process of converting document images into text. In OCR systems, text line segmentation affects the character segmentation stage that has a direct effect on the recognition rate of the system. In scanned images, some lines are skew or curl, and the correct recognition of these lines is another challenge in this field. Also, in some languages like Persian, some of the words have diacritic. In this paper, we introduce a novel text line segmentation method based on the final font size for Persian printed document images to solve these problems. In this method, by finding a specific size, the Connected-Components in a line are glued together. To this end, the pre-processing step of the proposed method removes every small object from the input image using a de-noising method. In the next step, the method measures the diameter of each connected component (CC) in the image to detects the final font size. In the last step, the method finds all CCs that horizontally are in the same direction and then connects them. Due to the lack of a Persian OCR dataset, we created such a dataset. The experimental results are executed on this dataset, and the proposed method reached 99.3% accuracy. It is important to note that this dataset has some curved lines, which increases the challenges in the dataset.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering
What problem does this paper attempt to address?