Synthesizing Videos from Images for Image-to-Video Adaptation

Junbao Zhuo,Xingyu Zhao,Shuhui Wang,Huimin Ma,Qingming Huang
DOI: https://doi.org/10.1145/3581783.3611897
2023-01-01
Abstract:We address the image-to-video adaptation task that aims to leverage labeled images and unlabeled videos for video recognition. There are two major challenges in this task, including the domain discrepancy between the two domains, and the modality gap between the image and video modalities. Existing methods mainly employ a two-stage paradigm by first adopting frame-level adaptation to reduce the domain discrepancy and then learning a spatio-temporal model to bridge the modality gap. In this paper, we provide a new perspective and propose a single-stage method that synthesizes video from the source static image and converts the image-to-video adaptation problem into a video-to-video adaptation problem. With the synthesized video, we present a simple baseline that a spatio-temporal model is trained with cross entropy loss with source labels and the Batch Nuclear norm Maximization loss to encourage the classification responses of target videos maintain the discriminability and diversity. We further propose a new pseudo label generation method that inherits the robustness of class prototype and the effectiveness of the small loss criterion. Based on the constructed baseline and the proposed pseudo label generation method, we train a model that achieves state-of-the-art performances or gets comparable performances on three standard benchmarks. Our codes are publicly available at https://github.com/junbaoZHUO/ST-I2V.
What problem does this paper attempt to address?