A Video Captioning Method Based on Visual-Text Semantic Association

Yan Fu,Xinli Wei
DOI: https://doi.org/10.1109/ICSP58490.2023.10248222
2023-01-01
Abstract:At present, video captioning methods based on codec frameworks often rely too much on the information of a single visual modality, which makes it difficult for the model to understand the video content accurately. To address this problem, this paper proposes a video captioning method based on visual-text semantic association (VC-VTSA) from the perspective of multimodal association. In the encoding stage, the method extracts 2D static features, 3D motion features, and object-level regional features of the video and integrates them into global visual features. In the semantic association phase, the generated vocabulary is combined into phrases with contextual semantic dependencies using a self-attentive mechanism, and they are associated with the visual features extracted in the encoding phase to create a bi-modal semantic region with visual content and textual information. By exploiting the potential associative complementary relationships between different modalities in the semantic region, the video content information is better characterized. In addition, a visual noise filtering strategy(VNFS) is designed in this paper to help the lexical phrases in the semantic zone and the corresponding visual content to be accurately associated. Finally, the constructed semantic regions are fed into the LSTM decoder for the next lexical prediction until the complete video captioning is generated.
What problem does this paper attempt to address?