Abstract:Temporal adaptive convolution has demonstrated superior performance over static convolution techniques in video understanding. However, it needs to be improved in long-time series modeling and multi-scale feature-map adaptation. To address these challenges, we introduce spatio-temporal hybrid adaptive convolution (STHAC), designed to enhance the spatio-temporal modeling capabilities of convolution. This is achieved by learning a set of spatio-temporal calibration filters to mitigate the spatial invariance intrinsic to static convolution methods. Specifically, STHAC learns a linear combination of N adaptive filters by parallelizing two lightweight attention branches. The resulting linearly mixed filters incorporate spatial multi-scale prior knowledge and long-range temporal dependencies. These spatio-temporal calibration filters modulate each frame’s static convolutional weight parameters, thereby endowing static convolution with spatial multi-scale adaptability and long-range temporal modeling capabilities. Compared to other dynamic convolution methods, our proposed calibration filters require fewer parameters and incur lower computational complexity. Moreover, we introduce an Omni-dimensional aggregation module to augment the spatio-temporal modeling capacity of STHAC. When combined with STHAC, this aggregation module forms the spatio-temporal adaptive module (STAM) that can replace static convolution. We implement a spatio-temporal dynamic network based on STAM to validate our approach. Experimental results indicate that our model is competitive with state-of-the-art convolutional neural network architectures on action recognition benchmarks such as Kinetics-400(K400) and Something-Something V2(SSV2).

STAM: a Spatio-Temporal Adaptive Module for Improving Static Convolutions in Action Recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

STCA: an action recognition network with spatio-temporal convolution and attention

Spatio-temporal Collaborative Convolution for Video Action Recognition

Spatio-Temporal Adaptive Network with Bidirectional Temporal Difference for Action Recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Dynamic Spatio-Temporal Feature Learning via Graph Convolution in 3D Convolutional Networks

Spatio-Temporal Collaborative Module for Efficient Action Recognition

STM: SpatioTemporal and Motion Encoding for Action Recognition.

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

MULTI-DIRECTIONAL CONVOLUTION NETWORKS WITH SPATIAL-TEMPORAL FEATURE PYRAMID MODULE FOR ACTION RECOGNITION

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition

Spatio-Temporal Attention Networks for Action Recognition and Detection

Spatio-Temporal Fusion Networks for Action Recognition

SSTA-Net: Self-supervised Spatio-Temporal Attention Network for Action Recognition.

Leveraging Spatio-Temporal Dependency for Skeleton-Based Action Recognition

STAM: A SpatioTemporal Attention Based Memory for Video Prediction

Multi-scale Spatial-Temporal Integration Convolutional Tube for Human Action Recognition

STST: Spatial-Temporal Specialized Transformer for Skeleton-based Action Recognition