Action-Driven Semantic Representation and Aggregation for Video Captioning

Tingting Han,Yaochen Xu,Jun Yu,Zhou Yu,Sicheng Zhao
DOI: https://doi.org/10.1109/tcsvt.2024.3502736
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Video captioning, a challenging task that entails generating natural language descriptions of visual content, often fails to effectively grasp the essence of action semantics. To harness the power of action detection to facilitate a deeper understanding of the video content, we propose an action-driven method, named Hierarchical Semantic Representation and Aggregation (HSRA) network. This method explicitly exploits action clues with a hierarchical semantic representation module, which models visual semantics in a three-level structure: “object-action-event”. By employing learnable action queries, our approach injects extensive action semantics into the model, thereby enabling more accurate and context-rich captions. To further enhance semantic alignment and understanding, we introduce a semantic aggregation composed of a semantic interaction module and a semantic refinement module. This component facilitates the alignment of semantics across different levels and emphasizes key information, ultimately leading to significant improvements in semantic consistency between the video and generated captions. We performed extensive evaluations on two well-established public datasets, MSVD and MSR-VTT, and the findings consistently demonstrate that our proposed HSRA network outperforms contemporary state-of-the-art methods.
What problem does this paper attempt to address?