DCMSTRD: End-to-end Dense Captioning via Multi-Scale Transformer Decoding

Zhuang Shao,Jungong Han,Kurt Debattista,Yanwei Pang
DOI: https://doi.org/10.1109/tmm.2024.3369863
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:Dense captioning creates diverse Region of Interests (RoIs) descriptions for complex visual scenes. While promising results have been obtained, several issues persist. In particular: 1) it is hard to find the optimal parameters for artificially designed modules (e.g., non-maximum suppression (NMS)) causing redundancies and fewer interactions to benefit the two sub-tasks of RoI detection and RoI captioning; 2) the absence of a multi-scale decoder in current methods hinders the acquisition of scale-invariant features, thus leading to poor performance. To tackle these limitations, we bypass the artificially designed modules and present an end-to-end dense captioning framework via multi-scale transformer decoding (DCMSTRD). DCMSTRD solves dense captioning by set matching and prediction instead. To further enhance the discriminative quality of the multi-scale representations during caption generation, we introduce a multi-scale module, termed multi-scale language decoder (MSLD). Our proposed method tested on standard datasets achieves a mean Average Precision (mAP) of 16.7% on the challenging VG-COCO dataset, demonstrating its effectiveness against the current methods.
computer science, information systems,telecommunications, software engineering
What problem does this paper attempt to address?