Abstract:Overview. In this paper, we propose a novel Two-Stream Fusion (denoted as “TSF”) video caption model for the 2 Microsoft Research Video to Text (MSR-VTT) Challenge [3, 4]. The core contribution of this model is to jointly discover and integrate the dynamics of both visual (i.e., Resnet101 features [2]) and semantic (i.e, high-level attributes) streams for video captioning. TSF. We present the overview of the proposed TSF model in Figure 1. Intuitively, a video can be complementarily represented by visual and semantic descriptors. The visual descriptor encodes the appearance information depicted in each frame, while the semantic descriptor encodes each video frame with high-level representation of semantic attributes (i.e., objects (nouns), motions (verbs) and properties (adjectives)). Given these two complementary information, our model first considers each modality as a unique stream, and uses a two-stream network to individually encode the dynamics of each modality. Particularly, we utilize the attention-LSTM unit [5] to enhance the individual feature learning. Then, a “combine unit” is deployed to linearly perform two-stream dynamic fusion for sentence generation. The softmax layer is deployed to get the probability distribution over the words. Feature Extraction. We selected 50 equally-spaced frames out of each video. For the visual features, we used a pretrained Resnet101 model [2] to obtain 2,048 dimensional frame-wise visual features, which were extracted from the ‘pool5’ layer. For the semantic features, we used a set of attributes to represent the visual content in each frame. Particularly, we used MSCOCO as the extended dataset and selected 256 most frequent words from the training captions as the high-level attributes. Then, we associated each MSCOCO image with a set of attributes according to its captions. The attribute detectors were trained with binary SVM [1]. Finally, the SVM [1] predictions were aggregated as a 256way vector and used as frame-wise semantic representation. Implementation. We randomly select 9,000 videos as training set and 1,000 videos as validation set. Each word in the sentence was represented as a “one-hot” vector. The word

MSR Video to Language Challenge

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Msr-Vtt: A Large Video Description Dataset for Bridging Video and Language

Video Captioning with Transferred Semantic Attributes.

XlanV Model with Adaptively Multi-Modality Feature Fusing for Video Captioning

Dual-Stream Recurrent Neural Network for Video Captioning

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset

Multimodal Semantic Attention Network for Video Captioning

Multimodal Fusion and Coherence Modeling for Video Topic Segmentation

Bidirectional Long-Short Term Memory for Video Description

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

Video Captioning With Attention-Based LSTM and Semantic Consistency

Fused GRU with Semantic-Temporal Attention for Video Captioning.

CC-LSTM: Cross and Conditional Long-Short Time Memory for Video Captioning

MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

Video Captioning Using Global-Local Representation

Video Captioning with Guidance of Multimodal Latent Topics

Attention-based Visual-Audio Fusion for Video Caption Generation.

Learning Video-Text Aligned Representations for Video Captioning

Discriminative Latent Semantic Graph for Video Captioning