Video-to-Image Casting: A Flatting Method for Video Analysis.

Xu Chen,Chenqiang Gao,Feng Yang,Xiaohan Wang,Yi Yang,Yahong Han
DOI: https://doi.org/10.1145/3474085.3475424
2021-01-01
Abstract:Previous mainstream video analysis methods, especially 3D CNNs-based models, mainly aim to transfer frameworks from the image domain to the video domain, and they follow the regime which has been succeeded in image processing, i.e., large-scale benchmarks and deep networks. However, processing videos is still time-consuming due to the increased computational cost. In this paper, we propose to flat the video and construct a Spatio-temporal Image (STI), i.e., squeezing the temporal dimension into a spatial plane. To pursuit the video-level modeling and efficient architecture, we devise a Collective Convolution (CoConv) operation to replace the 2D convolution. With the holistic sampling strategy, this novel operation can extract the video-level spatio-temporal representation. Moreover, we ensure that each CoConv operation has the same number of parameters as the original 2D filter, thus we can utilize a 2D network equipped with CoConv to analyze videos without additional computations. To verify the effectiveness of our method for the general video analysis, we evaluate it on three typical tasks, i.e., supervised action recognition, self-supervised action recognition, and dynamic texture recognition. Extensive experimental results show that our method can achieve comparable or state-of-the-art performances on these benchmarks while using much fewer computations compared with its 3D counterpart.
What problem does this paper attempt to address?