Abstract:Video captioning aims at automatically generating descriptive sentences based on the given video, establishing an association between the visual contents and textual languages, has attracted great attention and plays a significant role in many practical applications. Previous researches focus more on the aspect of caption generation, ignoring the alignment of multimodal feature and just simply concatenating them. Besides, video feature extraction is usually done in an off-line manner, which leads to the fact that the extracted feature may not adapted to the subsequent caption generation task. To improve the applicability of extracted features for downstream caption generation and to address the issue of multimodal semantic alignment fusion, we propose an end-to-end center-enhanced video captioning model with multimodal semantic alignment, which integrates feature extraction and caption generation task into a unified framework. In order to enhance the completeness of semantic features, we design a center enhancement strategy where the visual-textual deep joint semantic feature can be captured via incremental clustering, then the cluster centers can serve as the guidance for better caption generation. Moreover, we propose to promote the visual-textual multimodal alignment fusion by learning the visual and textual representation in a shared latent semantic space, so as to alleviate the multimodal misalignment problem. Experimental results on two popular datasets MSVD and MSR-VTT demonstrate that the proposed model could outperform the state-of-the-art methods, obtaining higher-quality caption results.

Hierarchical Attention-Based Multimodal Fusion for Video Captioning

Multimodal-enhanced hierarchical attention network for video captioning

Attention-based Visual-Audio Fusion for Video Caption Generation.

Attention-Based Multimodal Fusion for Video Description

Multimodal Semantic Attention Network for Video Captioning

Fusion of Multi-Modal Features to Enhance Dense Video Caption

Learning Multimodal Attention LSTM Networks for Video Captioning.

Event-centric multi-modal fusion method for dense video captioning

Integrating both Visual and Audio Cues for Enhanced Video Caption

Research on Feature Extraction and Multimodal Fusion of Video Caption Based on Deep Learning

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning

Center-enhanced video captioning model with multimodal semantic alignment

Multimodal Sentiment Analysis Based on Composite Hierarchical Fusion

Modality attention fusion model with hybrid multi-head self-attention for video understanding

MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

Multimodal feature fusion based on object relation for video captioning

Multimodal Sentiment Analysis Based on Cross-Modal Attention and Gated Cyclic Hierarchical Fusion Networks

Hierarchical Attention Networks for Multimodal Machine Learning

A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning

Multimodal Memory Modelling for Video Captioning