Abstract:In the field of computer vision, it is a challenging task to generate natural language captions from videos as input. To deal with this task, videos are usually regarded as feature sequences and input into Long-Short Term Memory (LSTM) to generate natural language. To get richer and more detailed video content representation, a Multimodal Interaction Video Captioning Network based on Semantic Association Graph (MIVCN) is developed towards this task. This network consists of two modules: Semantic association Graph Module (SAGM) and Multimodal Attention Constraint Module (MACM). Firstly, owing to lack of the semantic interdependence, existing methods often produce illogical sentence structures. Therefore, we propose a SAGM based on information association, which enables network to strengthen the connection between logically related languages and alienate the relations between logically unrelated languages. Secondly, features of each modality need to pay attention to different information among them, and the captured multimodal features are great informative and redundant. Based on the discovery, we propose a MACM based on LSTM, which can capture complementary visual features and filter redundant visual features. The MACM is applied to integrate multimodal features into LSTM, and make network to screen and focus on informative features. Through the association of semantic attributes and the interaction of multimodal features, the semantically contextual interdependent and visually complementary information can be captured by this network, and the informative representation in videos also can be better used for generating captioning. The proposed MIVCN realizes the best caption generation performance on MSVD: 56.8%, 36.4%, and 79.1% on BLEU@4, METEOR, and ROUGE-L evaluation metrics, respectively. Superior results are also reported on MSR-VTT about BLEU@4, METEOR, and ROUGE-L compared to state-of-the-art methods.

A Dataset with Multi-Modal Information and Multi-Granularity Descriptions for Video Captioning

Multimodality-guided Visual-Caption Semantic Enhancement

Multi-Modal interpretable automatic video captioning

Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Multi-scale features with temporal information guidance for video captioning

VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding

Multi-attention mechanism for Chinese description of videos

Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning

Multimodal-enhanced hierarchical attention network for video captioning

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Visual Commonsense-Aware Representation Network for Video Captioning

Multi-modal Summarization for Video-containing Documents

CMGNet: Collaborative multi-modal graph network for video captioning

Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions

Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content

Fusion of Multi-Modal Features to Enhance Dense Video Caption

MIVCN: Multimodal interaction video captioning network based on semantic association graph

Multimodal Semantic Attention Network for Video Captioning

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

An Attempt towards Interpretable Audio-Visual Video Captioning