CSDNet: Cross-Sketch with Dual Gated Attention for Fine-Grained Image Captioning Network

Md. Shamim Hossain,Shamima Aktar,Md. Bipul Hossen,Mohammad Alamgir Hossain,Naijie Gu,Zhangjin Huang
DOI: https://doi.org/10.1007/s11042-024-20220-z
IF: 2.577
2024-01-01
Multimedia Tools and Applications
Abstract:In the realm of extracting inter and intra-modal interactions, contemporary models often face challenges such as reduced computational efficiency, particularly when dealing with lengthy visual sequences. To address these issues, this study introduces an innovative model, the Cross-Sketch with Dual Gated Attention Network (CSDNet), designed to handle second-order intra- and inter-modal interactions by integrating a couple of attention modules. Leveraging bilinear pooling to effectively capture these second-order interactions typically requires substantial computational resources due to the processing of large-dimensional tensors. Due to these resource demands, the first module Cross-Sketch Attention (CSA) is proposed, which employs Cross-Tensor Sketch Pooling on attention features to reduce dimensionality while preserving crucial information without sacrificing caption quality. Furthermore, to enhance caption by integrating another novel attention module, Dual Gated Attention (DGA), which contributes additional spatial and channel-wise attention distributions to improve caption generation performance. Our method demonstrates significant computational efficiency improvements, reducing computation time per epoch by an average of 13.54
What problem does this paper attempt to address?