Abstract:In the field of computer vision, it is a challenging task to generate natural language captions from videos as input. To deal with this task, videos are usually regarded as feature sequences and input into Long-Short Term Memory (LSTM) to generate natural language. To get richer and more detailed video content representation, a Multimodal Interaction Video Captioning Network based on Semantic Association Graph (MIVCN) is developed towards this task. This network consists of two modules: Semantic association Graph Module (SAGM) and Multimodal Attention Constraint Module (MACM). Firstly, owing to lack of the semantic interdependence, existing methods often produce illogical sentence structures. Therefore, we propose a SAGM based on information association, which enables network to strengthen the connection between logically related languages and alienate the relations between logically unrelated languages. Secondly, features of each modality need to pay attention to different information among them, and the captured multimodal features are great informative and redundant. Based on the discovery, we propose a MACM based on LSTM, which can capture complementary visual features and filter redundant visual features. The MACM is applied to integrate multimodal features into LSTM, and make network to screen and focus on informative features. Through the association of semantic attributes and the interaction of multimodal features, the semantically contextual interdependent and visually complementary information can be captured by this network, and the informative representation in videos also can be better used for generating captioning. The proposed MIVCN realizes the best caption generation performance on MSVD: 56.8%, 36.4%, and 79.1% on BLEU@4, METEOR, and ROUGE-L evaluation metrics, respectively. Superior results are also reported on MSR-VTT about BLEU@4, METEOR, and ROUGE-L compared to state-of-the-art methods.

CDKM: Common and Distinct Knowledge Mining Network with Content Interaction for Dense Captioning

DCMSTRD: End-to-end Dense Captioning via Multi-Scale Transformer Decoding

Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

Collaborative Detection and Caption Network

Region-Focused Network for Dense Captioning

MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes

Context and Attribute Grounded Dense Captioning

Complete 3D Relationships Extraction Modality Alignment Network for 3D Dense Captioning.

Bi-Directional Co-Attention Network for Image Captioning

X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning

Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching

Weakly Supervised Dense Video Captioning

CMGNet: Collaborative multi-modal graph network for video captioning

Fusion of Multi-Modal Features to Enhance Dense Video Caption

MIVCN: Multimodal interaction video captioning network based on semantic association graph

Dual-level Collaborative Transformer for Image Captioning

End-to-End 3D Dense Captioning with Vote2Cap-DETR.

Exploring Discrete Diffusion Models for Image Captioning

Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning

Multi-Keys Attention Network for Image Captioning

Divided Caption Model with Global Attention