Abstract:In recent years, video action recognition, as a fundamental task in the field of video understanding, has been deeply explored by numerous researchers.Most traditional video action recognition methods typically involve converting videos into three-dimensional data that encapsulates both spatial and temporal information, subsequently leveraging prevalent image understanding models to model and analyze these data. However,these methods have significant drawbacks. Firstly, when delving into video action recognition tasks, image understanding models often need to be adapted accordingly in terms of model architecture and preprocessing for these spatiotemporal tasks; Secondly, dealing with high-dimensional data often poses greater challenges and incurs higher time costs compared to its lower-dimensional <a class="link-external link-http" href="http://counterparts.To" rel="external noopener nofollow">this http URL</a> bridge the gap between image-understanding and video-understanding tasks while simplifying the complexity of video comprehension, we introduce a novel video representation architecture, Flatten, which serves as a plug-and-play module that can be seamlessly integrated into any image-understanding network for efficient and effective 3D temporal data modeling.Specifically, by applying specific flattening operations (e.g., row-major transform), 3D spatiotemporal data is transformed into 2D spatial information, and then ordinary image understanding models are used to capture temporal dynamic and spatial semantic information, which in turn accomplishes effective and efficient video action recognition. Extensive experiments on commonly used datasets (Kinetics-400, Something-Something v2, and HMDB-51) and three classical image classification models (Uniformer, SwinV2, and ResNet), have demonstrated that embedding Flatten provides a significant performance improvements over original model.

Video-to-Image Casting: A Flatting Method for Video Analysis.

Spatio-Temporal Deformable Convolution for Compressed Video Quality Enhancement

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Dynamic Spatio-Temporal Feature Learning via Graph Convolution in 3D Convolutional Networks

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

Spatio-Temporal Collaborative Module for Efficient Action Recognition

Is 3D Convolution with 5D Tensors Really Necessary for Video Analysis?

Dynamic information enhancement for video classification

CoSTA: End-to-End Comprehensive Space-Time Entanglement for Spatio-Temporal Video Grounding

No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Flatten: Video Action Recognition is an Image Classification task

MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recognition.

Exploiting Temporal Consistency for Real-Time Video Depth Estimation

Dynamic and Compressive Adaptation of Transformers From Images to Videos

Efficient Video Transformers with Spatial-Temporal Token Selection

Cross-Fiber Spatial-Temporal Co-enhanced Networks for Video Action Recognition

CAST: Cross-Attention in Space and Time for Video Action Recognition

A Real-Time Action Representation With Temporal Encoding and Deep Compression

Temporal-attentive Covariance Pooling Networks for Video Recognition