MSR Video to Language Challenge
Ning Xu,Junnan Li,Yang Li,An-An Liu,Yongkang Wong,Weizhi Nie,Yuting Su,Mohan S. Kankanhalli
2016-01-01
Abstract:Overview. In this paper, we propose a novel Two-Stream Fusion (denoted as “TSF”) video caption model for the 2 Microsoft Research Video to Text (MSR-VTT) Challenge [3, 4]. The core contribution of this model is to jointly discover and integrate the dynamics of both visual (i.e., Resnet101 features [2]) and semantic (i.e, high-level attributes) streams for video captioning. TSF. We present the overview of the proposed TSF model in Figure 1. Intuitively, a video can be complementarily represented by visual and semantic descriptors. The visual descriptor encodes the appearance information depicted in each frame, while the semantic descriptor encodes each video frame with high-level representation of semantic attributes (i.e., objects (nouns), motions (verbs) and properties (adjectives)). Given these two complementary information, our model first considers each modality as a unique stream, and uses a two-stream network to individually encode the dynamics of each modality. Particularly, we utilize the attention-LSTM unit [5] to enhance the individual feature learning. Then, a “combine unit” is deployed to linearly perform two-stream dynamic fusion for sentence generation. The softmax layer is deployed to get the probability distribution over the words. Feature Extraction. We selected 50 equally-spaced frames out of each video. For the visual features, we used a pretrained Resnet101 model [2] to obtain 2,048 dimensional frame-wise visual features, which were extracted from the ‘pool5’ layer. For the semantic features, we used a set of attributes to represent the visual content in each frame. Particularly, we used MSCOCO as the extended dataset and selected 256 most frequent words from the training captions as the high-level attributes. Then, we associated each MSCOCO image with a set of attributes according to its captions. The attribute detectors were trained with binary SVM [1]. Finally, the SVM [1] predictions were aggregated as a 256way vector and used as frame-wise semantic representation. Implementation. We randomly select 9,000 videos as training set and 1,000 videos as validation set. Each word in the sentence was represented as a “one-hot” vector. The word