Abstract:In the field of computer vision, it is a challenging task to generate natural language captions from videos as input. To deal with this task, videos are usually regarded as feature sequences and input into Long-Short Term Memory (LSTM) to generate natural language. To get richer and more detailed video content representation, a Multimodal Interaction Video Captioning Network based on Semantic Association Graph (MIVCN) is developed towards this task. This network consists of two modules: Semantic association Graph Module (SAGM) and Multimodal Attention Constraint Module (MACM). Firstly, owing to lack of the semantic interdependence, existing methods often produce illogical sentence structures. Therefore, we propose a SAGM based on information association, which enables network to strengthen the connection between logically related languages and alienate the relations between logically unrelated languages. Secondly, features of each modality need to pay attention to different information among them, and the captured multimodal features are great informative and redundant. Based on the discovery, we propose a MACM based on LSTM, which can capture complementary visual features and filter redundant visual features. The MACM is applied to integrate multimodal features into LSTM, and make network to screen and focus on informative features. Through the association of semantic attributes and the interaction of multimodal features, the semantically contextual interdependent and visually complementary information can be captured by this network, and the informative representation in videos also can be better used for generating captioning. The proposed MIVCN realizes the best caption generation performance on MSVD: 56.8%, 36.4%, and 79.1% on BLEU@4, METEOR, and ROUGE-L evaluation metrics, respectively. Superior results are also reported on MSR-VTT about BLEU@4, METEOR, and ROUGE-L compared to state-of-the-art methods.

A Multimodal Aggregation Network with Serial Self-Attention Mechanism for Micro-Video Multi-Label Classification

Multimodal Attentive Representation Learning for Micro-video Multi-label Classification

Self-supervised Deep Partial Adversarial Network for Micro-Video Multimodal Classification

Multimodal Progressive Modulation Network for Micro-video Multi-label Classification

Context-aware focal alignment network for micro-video multi-label classification

Attention-enhanced Joint Learning Network for Micro-Video Venue Classification

Multimodal Deep Hierarchical Semantic-Aligned Matrix Factorization Method for Micro-Video Multi-Label Classification

Learning Dual Low-Rank Representation for Multi-Label Micro-Video Classification.

SADCMF: Self-Attentive Deep Consistent Matrix Factorization for Micro-Video Multi-Label Classification

Multimodal Semantic Attention Network for Video Captioning

Multivariate Attention Network For Image Captioning

Neural Multimodal Cooperative Learning Toward Micro-Video Understanding

An Efficient Multimodal Aggregation Network for Video-Text Retrieval

CMGNet: Collaborative multi-modal graph network for video captioning

Deep Reinforcement Learning Visual-Text Attention for Multimodal Video Classification

Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classification

Multi-View Attention Network for Remote Sensing Image Captioning

Deep Matrix Factorization with Complementary Semantic Aggregation for Micro-Video Multi-Label Classification

Multimodal-enhanced hierarchical attention network for video captioning

MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video

MIVCN: Multimodal interaction video captioning network based on semantic association graph