Visual and textual based multimodal document object detection

Li Yuteng,Shi Cao,Xu Canhui,Cheng Yuanzhi
DOI: https://doi.org/10.19734/j.issn.1001-3695.2022.08.0425
2023-01-01
Abstract:The layout of document images is complex and distribution of object sizes is uneven, currently, most of detection methods ignore multimodal information and global dependencies.Therefore, this paper proposed a multimodal document object detection method based on vision and text.Firstly, this method explored the fusion strategy of multimodal features.In order to utilize textual features, it converted text sequence information of the image into two-dimensional representation.After the initial fusion of text features and visual features, it input the fused features to backbone network to extract multiscale features, and repeatedly integrated textual features during the extraction process, so as to realize deep fusion of multimodal features.Next, to ensure the detection accuracy of small and large objects, this paper designed a pyramid network.The lateral connection could concatenate feature maps of the same spatial size from the bottom-up pathway and the top-down pathway in channel, so as to achieve the propagation between high-level semantic information and low-level feature information.The experimental results on large public dataset PubLayNet show that the detection accuracy of this method reaches 95.86%,and it has a higher accuracy than other methods.This method not only realizes the deep fusion of multimodal features, but also enriches the fused multimodal feature information, and it has good detection performance.
What problem does this paper attempt to address?