ReLayout: Towards Real-World Document Understanding via Layout-enhanced Pre-training

Zhouqiang Jiang,Bowen Wang,Junhao Chen,Yuta Nakashima
2024-10-16
Abstract:Recent approaches for visually-rich document understanding (VrDU) uses manually annotated semantic groups, where a semantic group encompasses all semantically relevant but not obviously grouped words. As OCR tools are unable to automatically identify such grouping, we argue that current VrDU approaches are unrealistic. We thus introduce a new variant of the VrDU task, real-world visually-rich document understanding (ReVrDU), that does not allow for using manually annotated semantic groups. We also propose a new method, ReLayout, compliant with the ReVrDU scenario, which learns to capture semantic grouping through arranging words and bringing the representations of words that belong to the potential same semantic group closer together. Our experimental results demonstrate the performance of existing methods is deteriorated with the ReVrDU task, while ReLayout shows superiour performance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to achieve automated document understanding in visually - rich document understanding (VrDU) tasks without using manually - annotated semantic groups. Specifically, the paper points out that current VrDU methods rely on manually - annotated semantic groups to capture the semantic structure in documents, which is unrealistic in practical applications because automated tools such as OCR cannot automatically recognize these semantic groups. Therefore, the paper proposes a new VrDU variant - real - world visually - rich document understanding (ReVrDU), and a new method named ReLayout. ReLayout, through a pre - training model that enhances layout information, learns how to capture semantic grouping by arranging words and bringing the word representations belonging to potentially the same semantic group closer together, which is suitable for the ReVrDU scenario. The main contributions of the paper are as follows: 1. **Proposing the ReVrDU task**: Different from the existing VrDU tasks, ReVrDU does not allow the use of manually - annotated semantic groups, but only allows the use of information provided by OCR tools, such as words, global 1D positions, word - level 2D bounding boxes and text paragraphs, so as to be closer to the practical application scenarios. 2. **Designing the ReLayout model**: The ReLayout model enhances the understanding of local layout structures and relationships by introducing 1D local order prediction (1 - LOP) and 2D text - paragraph clustering (2 - TSC) strategies, and learns potential semantic group information in a self - supervised manner. 3. **Experimental verification**: Through experiments on multiple datasets, it is proved that ReLayout outperforms existing methods in downstream tasks in both ideal and real - world scenarios, especially showing higher robustness when using information extracted by different OCR tools. In conclusion, this paper aims to promote the research in the VrDU field to be more practical and automated, reduce the dependence on manually - annotated data, and improve the performance and reliability of models in practical applications.