HiM: hierarchical multimodal network for document layout analysis

Xu Canhui,Li Yuteng,Shi Cao,Zhang Honghong,Bi Hengyue,Chen Yinong
DOI: https://doi.org/10.1007/s10489-023-04782-3
IF: 5.3
2023-07-23
Applied Intelligence
Abstract:In document layout analysis, both computer vision based and natural language processing based methods are employed individually or integrated to enrich the feature information resources and to enforce object detection. To simultaneously leverage visual and textual modalities, this paper proposes a hierarchical multimodal (HiM) network to aggregate representative features from multi-source inputs with the introduction of complementary semantics and non-local context dependencies across grained scales. Different channel and spatial attention mechanisms are adapted to different modalities. The visual modality is based on conventional convolution network, while the textual modality focuses on embedding hierarchical textual vectors and positioning. The feature representations from multiple modalities are then integrated adaptively in feature pyramid network for subsequent region proposal processing. We have made database adaptation on PubLayNet, including inserting semi-structure elements and extending ground truth annotations by parsing PDF pages. On three popular benchmarks, including Article Regions, PubLayNet and DocBank, extensive experiments are carried out to verify the effectiveness and adaptability of HiM.
computer science, artificial intelligence
What problem does this paper attempt to address?