A Fusion Framework of Whitespace Smear Cutting and Swin Transformer for Document Layout Analysis

Ran Chen,Jo-Ku Cheng,Jinwen Ma
DOI: https://doi.org/10.1007/978-981-97-5597-4_29
2024-01-01
Abstract:Document Layout Analysis (DLA) is critical for understanding and reconstructing documents, which aims to automatically recognize the layout structure of basic or semantic elements within a document. However, a DLA process faces certain challenges due to the diversity and complexity of document layouts with a variety of languages. In particular, it needs more theoretical and practical investigations for analyzing Chinese documents. This paper proposes a fusion framework of Whitespace Smear Cutting (WSC) and Swin Transformer for layout analysis, mainly in Chinese documents. Specifically, in the first phase, we perform a new kind of unsupervised segmentation of document images with our proposed WSC algorithm that can preserve the delicate edges of the connected blocks of a document. In the second phase, we utilize a novel semantic segmentation network based on the Swin Transformer for pixel-level classification. We design a new training paradigm of continuous training for the Swin Transformer, which consists of pre-training on the large-scale data and fine-tuning on the specific datasets to adapt the model to the special data distributions. In our fusion process, we utilize the pixel-level semantic information to direct and integrate the same semantic connected blocks obtained from the WSC algorithm and semantic segmentation with certain rules based on the confidence levels and block distributions, which effectively alleviates the challenging problem of bounding box overlap and thus improves the accuracy of semantic classification. Finally, it is demonstrated by the experimental results on a collected dataset of Chinese documents and the POD dataset that our proposed fusion framework is feasible and effective on DLA.
What problem does this paper attempt to address?