Dual visual align-cross attention-based image captioning transformer
Yonggong Ren,Jinghan Zhang,Wenqiang Xu,Yuzhu Lin,Bo Fu,Dang N. H. Thanh
DOI: https://doi.org/10.1007/s11042-024-19315-4
IF: 2.577
2024-05-18
Multimedia Tools and Applications
Abstract:Region-based features widely used in image captioning are typically extracted using object detectors like Faster R-CNN. However, the approach has a limitation due to capturing region-level information and does not consider the holistic global information of the entire image. This limitation hinders the development of complex multi-modal reasoning capabilities in image captioning and leads to issues such as a lack of contextual information, inaccurate object detection, and high computational costs. To address these limitations and leverage the success of transformer-based architectures in image captioning, a transformer-based neural structure called DVAT (Dual Visual Attention-based Image Captioning Transformer) is proposed. DVAT effectively combines two visual features to generate more accurate captions. It divides region features into semi-region feature self-attention operations, which compute hidden features of the image, and semi-region feature convolutional operations, which capture background and contextual information. This approach enhances the receptive field of grid features while accelerating computation. Moreover, DVAT incorporates aligned-cross attention between region features and grid features to better integrate the dual visual features. This innovative design and fusion of dual visual features result in notable performance enhancements. Experimental results on multiple image captioning benchmarks demonstrate that DVAT outperforms previous methods in terms of both inference accuracy and speed. Extensive experiments conducted on the MS COCO dataset further validate that DVAT surpasses many state-of-the-art techniques.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering