DistInit: Learning Video Representations Without a Single Labeled Video

Rohit Girdhar,Du Tran,Lorenzo Torresani,Deva Ramanan
DOI: https://doi.org/10.48550/arXiv.1901.09244
2019-08-20
Abstract:Video recognition models have progressed significantly over the past few years, evolving from shallow classifiers trained on hand-crafted features to deep spatiotemporal networks. However, labeled video data required to train such models have not been able to keep up with the ever-increasing depth and sophistication of these networks. In this work, we propose an alternative approach to learning video representations that require no semantically labeled videos and instead leverages the years of effort in collecting and labeling large and clean still-image datasets. We do so by using state-of-the-art models pre-trained on image datasets as "teachers" to train video models in a distillation framework. We demonstrate that our method learns truly spatiotemporal features, despite being trained only using supervision from still-image networks. Moreover, it learns good representations across different input modalities, using completely uncurated raw video data sources and with different 2D teacher models. Our method obtains strong transfer performance, outperforming standard techniques for bootstrapping video architectures with image-based models by 16%. We believe that our approach opens up new approaches for learning spatiotemporal representations from unlabeled video data.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of insufficient labeled data in video understanding tasks. Specifically, the authors propose a method called DistInit, which trains video models using large-scale well-annotated image datasets without any semantically annotated video data. The core idea of this method is to use a pre-trained image model as a "teacher" to guide the learning process of the video model, enabling the video model to learn spatiotemporal features and perform effective video understanding even without explicit action labels. ### Main Contributions 1. **No Annotated Videos Required**: By leveraging existing large-scale annotated image datasets (such as ImageNet), this method can train video models without using any annotated video data. 2. **Cross-Modal Knowledge Transfer**: This method can transfer the knowledge learned by the image model to the video model, applicable to different input modalities, and can use completely uncurated raw video data sources. 3. **Performance Improvement**: Experimental results show that the DistInit method achieves significant performance improvements on multiple benchmark datasets, outperforming traditional methods that start training from scratch or initialize using inflated 2D models by 16%. ### Method Overview - **Framework Design**: The DistInit method uses a pre-trained 2D image model as a "teacher" to guide the learning of a 3D video model. Specifically, frames are randomly selected from the video, soft labels are generated using the 2D model, and then the video model is trained to match these soft labels. - **Loss Function**: A cross-entropy loss function is used to minimize the difference between the distribution predicted by the student model and the distribution generated by the teacher model. - **Multi-Source Knowledge Distillation**: Performance can be further improved by combining teacher models from multiple different tasks (such as object classification and scene classification). ### Experimental Results - **HMDB-51 Dataset**: On the HMDB-51 dataset, the DistInit method significantly outperforms training from scratch and initializing using inflated 2D models, with performance improvements of 15% and 4%, respectively. - **UCF-101 Dataset**: Similar performance improvements were achieved on the UCF-101 dataset. - **Visualization Analysis**: By visualizing the learned filters and feature representations, it can be seen that the DistInit method is able to learn genuine temporal dynamic features, rather than simply replicating static image features. ### Conclusion This paper proposes a novel and effective method that trains video models using large-scale annotated image datasets, addressing the issue of insufficient labeled data in video understanding tasks. Experimental results show that this method achieves significant performance improvements on multiple benchmark datasets, providing new insights for future video understanding research.