An Efficient Transformer–CNN Network for Document Image Binarization

Lina Zhang,Kaiyuan Wang,Yi Wan
DOI: https://doi.org/10.3390/electronics13122243
IF: 2.9
2024-06-08
Electronics
Abstract:Color image binarization plays a pivotal role in image preprocessing work and significantly impacts subsequent tasks, particularly for text recognition. This paper concentrates on document image binarization (DIB), which aims to separate an image into a foreground (text) and background (non-text content). We thoroughly analyze conventional and deep-learning-based approaches and conclude that prevailing DIB methods leverage deep learning technology. Furthermore, we explore the receptive fields of pre- and post-network training to underscore the Transformer model's advantages. Subsequently, we introduce a lightweight model based on the U-Net structure and enhanced with the MobileViT module to capture global information features in document images better. Given its adeptness at learning both local and global features, our proposed model demonstrates competitive performance on two standard datasets (DIBCO2012 and DIBCO2017) and good robustness on the DIBCO2019 dataset. Notably, our proposed method presents a straightforward end-to-end model devoid of additional image preprocessing or post-processing, eschewing the use of ensemble models. Moreover, its parameter count is less than one-eighth of the model, which achieves the best results on most DIBCO datasets. Finally, two sets of ablation experiments are conducted to verify the effectiveness of the proposed binarization model.
engineering, electrical & electronic,computer science, information systems,physics, applied
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper primarily aims to address the issue of Document Image Binarization (DIB). Specifically, the goal of the paper is to separate the text (foreground) from the background (non-text content) in document images. The purpose of document image binarization is to convert the image into a "black text on white paper" format, i.e., setting the foreground pixel value to 0 and the background pixel value to 255. #### Research Background and Challenges - **Degraded Document Processing**: Ancient document data is often severely degraded, such as yellowing paper and ink contamination. Manually processing a large amount of historical text data is time-consuming, labor-intensive, and prone to errors. - **Limitations of Existing Methods**: Traditional binarization methods (such as the Otsu algorithm, Niblack method, etc.) perform poorly when dealing with low-contrast or unevenly illuminated images. Although deep learning-based methods perform better, they still have shortcomings when dealing with complex background textures. #### Proposed Method - **Combining U-Net and Transformer**: A lightweight model based on the U-Net structure and incorporating the MobileViT module is proposed to better capture the global information features in document images. - **Model Characteristics**: The model has a relatively small number of parameters, only one-fourth of similar models, and possesses good local and global feature learning capabilities. - **Experimental Results**: The model performs excellently on two standard datasets (DIBCO2012 and DIBCO2017) and also shows good robustness on the DIBCO2019 dataset. By introducing the MobileViT module, the model can effectively improve the performance of document image binarization while maintaining efficiency.