CDKM: Common and Distinct Knowledge Mining Network with Content Interaction for Dense Captioning

Hongyu Deng,Yushan Xie,Qi Wang,Jianjun Wang,Weijian Ruan,Wu Liu,Yong-Jin Liu
DOI: https://doi.org/10.1109/TMM.2024.3407695
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:The dense captioning task aims at detecting multiple salient regions of an image and describing them separately in natural language. Although significant advancements in the field of dense captioning have been made, there are still some limitations to existing methods in recent years. On the one hand, most dense captioning methods lack strong target detection capabilities and struggle to cover all relevant content when dealing with target-intensive images. On the other hand, current transformer-based methods are powerful but neglect the acquisition and utilization of contextual information, hindering the visual understanding of local areas. To address these issues, we propose a common and distinct knowledge-mining network with content interaction for the task of dense captioning. Our network has a knowledge mining mechanism that improves the detection of salient targets by capturing common and distinct knowledge from multi-scale features. We further propose a content interaction module that combines region features into a unique context based on their correlation. Our experiments on various benchmarks have shown that the proposed method outperforms the current state-of-the-art methods.
What problem does this paper attempt to address?