Exploiting Images for Video Recognition: Heterogeneous Feature Augmentation Via Symmetric Adversarial Learning.

Feiwu Yu,Xinxiao Wu,Jialu Chen,Lixin Duan
DOI: https://doi.org/10.1109/tip.2019.2917867
IF: 10.6
2019-01-01
IEEE Transactions on Image Processing
Abstract:Training deep models of video recognition usually requires sufficient labeled videos in order to achieve good performance without over-fitting. However, it is quite labor-intensive and time-consuming to collect and annotate a large amount of videos. Moreover, training deep neural networks on large-scale video datasets always demands huge computational resources which further hold back many researchers and practitioners. To resolve that, collecting and training on annotated images are much easier. However, thoughtlessly applying images to help recognize videos may result in noticeable performance degeneration due to the well-known domain shift and feature heterogeneity. This proposes a novel symmetric adversarial learning approach for heterogeneous image-to-video adaptation, which augments deep image and video features by learning domain-invariant representations of source images and target videos. Primarily focusing on an unsupervised scenario where the labeled source images are accompanied by unlabeled target videos in the training phrase, we present a data-driven approach to respectively learn the augmented features of images and videos with superior transformability and distinguishability. Starting with learning a common feature space (called image-frame feature space) between images and video frames, we then build new symmetric generative adversarial networks (Sym-GANs) where one GAN maps image-frame features to video features and the other maps video features to image-frame features. Using the Sym-GANs, the source image feature is augmented with the generated video-specific representation to capture the motion dynamics while the target video feature is augmented with the image-specific representation to take the static appearance information. Finally, the augmented features from the source domain are fed into a network with fully connected layers for classification. Thanks to an end-to-end training procedure of the Sym-GANs and the classification network, our approach achieves better results than other state-of-the-arts, which is clearly validated by experiments on two video datasets, i.e., the UCF101 and HMDB51 datasets.
What problem does this paper attempt to address?