Abstract:Overview. In this paper, we propose a novel Two-Stream Fusion (denoted as “TSF”) video caption model for the 2 Microsoft Research Video to Text (MSR-VTT) Challenge [3, 4]. The core contribution of this model is to jointly discover and integrate the dynamics of both visual (i.e., Resnet101 features [2]) and semantic (i.e, high-level attributes) streams for video captioning. TSF. We present the overview of the proposed TSF model in Figure 1. Intuitively, a video can be complementarily represented by visual and semantic descriptors. The visual descriptor encodes the appearance information depicted in each frame, while the semantic descriptor encodes each video frame with high-level representation of semantic attributes (i.e., objects (nouns), motions (verbs) and properties (adjectives)). Given these two complementary information, our model first considers each modality as a unique stream, and uses a two-stream network to individually encode the dynamics of each modality. Particularly, we utilize the attention-LSTM unit [5] to enhance the individual feature learning. Then, a “combine unit” is deployed to linearly perform two-stream dynamic fusion for sentence generation. The softmax layer is deployed to get the probability distribution over the words. Feature Extraction. We selected 50 equally-spaced frames out of each video. For the visual features, we used a pretrained Resnet101 model [2] to obtain 2,048 dimensional frame-wise visual features, which were extracted from the ‘pool5’ layer. For the semantic features, we used a set of attributes to represent the visual content in each frame. Particularly, we used MSCOCO as the extended dataset and selected 256 most frequent words from the training captions as the high-level attributes. Then, we associated each MSCOCO image with a set of attributes according to its captions. The attribute detectors were trained with binary SVM [1]. Finally, the SVM [1] predictions were aggregated as a 256way vector and used as frame-wise semantic representation. Implementation. We randomly select 9,000 videos as training set and 1,000 videos as validation set. Each word in the sentence was represented as a “one-hot” vector. The word

VideoTRM: Pre-training for Video Captioning Challenge 2020

TransVOS: Video Object Segmentation with Transformers

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

TEVL: Trilinear Encoder for Video-language Representation Learning

CLIP4Caption: CLIP for Video Caption

Memory-enhanced Hierarchical Transformer for Video Paragraph Captioning

TVT: Two-View Transformer Network for Video Captioning.

Early Embedding and Late Reranking for Video Captioning

MSR Video to Language Challenge

TrTr-CMR: Cross-Modal Reasoning Dual Transformer for Remote Sensing Image Captioning

AutoTVG: A New Vision-language Pre-training Paradigm for Temporal Video Grounding

Video Captioning Using Global-Local Representation

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training

Text-Conditioned Resampler For Long Form Video Understanding

Multi-scale features with temporal information guidance for video captioning

Video captioning based on vision transformer and reinforcement learning

Visual-Text Reference Pretraining Model for Image Captioning

Global-Shared Text Representation Based Multi-Stage Fusion Transformer Network for Multi-Modal Dense Video Captioning

RESTHT: relation-enhanced spatial–temporal hierarchical transformer for video captioning

Multirate Multimodal Video Captioning.

Auto-captions on GIF