Abstract:Generating video captioning automatically is an active and flouring research topic that involves the complex interactions between visual features and natural language generation. The attention mechanism obtains the key visual information corresponding to the word by removing redundant information. However, existing visual attention methods are indirectly guided by the hidden state of language model, ignoring the interactions between visual features obtained by attention mechanisms. Due to the existence of incomplete object or interference noise, attention mechanism with frame feature is hard to find correct regions-of-interest which closely related to the motion state. Worse still, at each time step, the hidden states have no access to the posterior decode states. The future predicted information is not fully utilized, which lead to the lack of detailed context-aware information. In this paper, we propose a novel video captioning framework with Memory-attended Semantic Context-aware Network (MaSCN) to capture the adjacent sequential dependency across multiple time stamps between different outputs for visual features. To exploit pivotal feature from coarse-grained to fine-grained, we introduce the attention module in MaSCN, which uses corresponding tailored Visual Semantic LSTM(VSLSTM) layers to more precisely map visual relationship information through multi-level attention mechanism. Besides, we integrate the visual features obtained through the attention mechanism as a late fusion. The visual semantic loss is used to explicitly memorize contextual information, capturing the fine-grained detailed cues. Compared with state-of-the-art approaches, the extensive experiments demonstrate the effectiveness of our method on MSVD and MSR-VTT datasets.

Attention-based LSTM with Semantic Consistency for Videos Captioning

Video Captioning With Attention-Based LSTM and Semantic Consistency

Richer Semantic Visual and Language Representation for Video Captioning

Video Captioning with Transferred Semantic Attributes.

Rich Visual and Language Representation with Complementary Semantics for Video Captioning

Learning Multimodal Attention LSTM Networks for Video Captioning.

Residual Attention-Based LSTM for Video Captioning

Describing Video with Attention-Based Bidirectional LSTM

Attention based CNN-LSTM network for video caption

Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning

CC-LSTM: Cross and Conditional Long-Short Time Memory for Video Captioning

Fused GRU with Semantic-Temporal Attention for Video Captioning.

Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning

Exploiting long-term temporal dynamics for video captioning

Hierarchical LSTMs with Adaptive Attention for Visual Captioning

Memory-attended semantic context-aware network for video captioning

Multi-guiding Long Short-Term Memory for Video Captioning

A novel Multi-Layer Attention Framework for visual description prediction using bidirectional LSTM

Bidirectional Long-Short Term Memory for Video Description

Video Captioning with Semantic Guiding

Spatio-Temporal Graph-based Semantic Compositional Network for Video Captioning