Abstract:Videos naturally contain dynamic variation over the temporal axis, which will result in the same visual clues (e.g., semantics, objects) changing their scale, position, and perspective patterns between adjacent frames. A primary trend in video CNN is adopting spatial-2D convolution for spatial semantics and temporal-1D convolution for temporal dynamics. Though the direction achieves a favorable balance between efficiency and efficacy, it suffers from misalignment of visual clues with large displacements. Particularly, rigid temporal convolution would fail to capture correct motions when a specific target moves out of the reception field of temporal convolution between adjacent frames. To tackle large visual displacements between temporal neighbors, we propose a new temporal convolution named Hourglass Convolution (HgC). The temporal reception field of HgC has an hourglass shape, where the spatial reception field is enlarged in prior & post temporal frames, enabling an ability to capture large displacement. Moreover, since videos contain long, short-term movements viewed from multiple temporal interval levels, we hierarchically organize the HgC net to both capture temporal dynamics from frame (short-term) and clip (long-term) levels. Besides, we also adopt strategies, such as low-resolution for short-term modeling and channel reduction for long-term modeling, from efficiency concerns. With HgC, our (HCN)-C-2 equips off-the-shelf CNNs with a strong ability in capturing spatio-temporal dynamics at a neglectable computation overhead. We validate the efficiency and efficacy of HgC on standard action recognition benchmarks, including Something-Something V1&V2, Diving48, and EGTEA Gaze+. We also analyse the complementarity of frame-level motion and clip-level motion with visualizations. The code and models will be available at https://github.com/ty-97/H2CN.

Hierarchical Hourglass Convolutional Network for Efficient Video Classification

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Convolutional Drift Networks for Video Classification

HCFNN: High-order Coverage Function Neural Network for Image Classification

Dynamic Spatio-Temporal Feature Learning via Graph Convolution in 3D Convolutional Networks

Maximizing Spatio-Temporal Entropy of Deep 3D CNNs for Efficient Video Recognition

High-Efficiency Neural Video Compression via Hierarchical Predictive Learning

Learning Hierarchical Video Representation for Action Recognition

HCMS: Hierarchical and Conditional Modality Selection for Efficient Video Recognition

HDECGCN: A Heterogeneous Dual Enhanced Network Based on Hybrid CNNs Joint Multiscale Dynamic GCNs for Hyperspectral Image Classification

Grouped Temporal Enhancement Module for Human Action Recognition

HCAM-CL: A Novel Method Integrating a Hierarchical Cross-Attention Mechanism with CNN-LSTM for Hierarchical Image Classification

SCN: Dilated silhouette convolutional network for video action recognition

CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos

Space-Time Separate Modeling for Efficient Video Classification

HD-CNN: Hierarchical Deep Convolutional Neural Networks for Large Scale Visual Recognition

STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition

Hypercorrelation Evolution for Video Class-Incremental Learning

Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification

Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network

Video-ception Network: Towards Multi-Scale Efficient Asymmetric Spatial-Temporal Interactions