Reading order detection in visually-rich documents with multi-modal layout-aware relation prediction
Liang Qiao,Can Li,Zhanzhan Cheng,Yunlu Xu,Yi Niu,Xi Li
DOI: https://doi.org/10.1016/j.patcog.2024.110314
IF: 8
2024-02-04
Pattern Recognition
Abstract:Reading order detection aims to arrange the text logically, which is essential in understanding visual documents. Current methods mostly model the problem as a sequence generation task, which use insufficient modalities information ignore the various reading habits under different document layouts, leading to the lack of robustness for some complex scenarios. To address these challenges, we present a novel approach with the Multi-Modal Layout-Aware Relation Prediction. It employs a straightforward yet highly effective task formulation for predicting the order relation between text instances. Our model leverages visual, semantic, and positional features, with the positional features being adaptively generated through a layout-aware position embedding module. Then, different modality features are enhanced via a two-staged position-guided multi-modal fusion module. Additionally, we introduce two novel loss functions, Degree Loss and Cycle Loss, to effectively impose network constraints at multiple levels. Our experimental results, conducted on three real-world datasets, demonstrate that our proposed method achieves a new state-of-the-art level of performance.
computer science, artificial intelligence,engineering, electrical & electronic