STAM: a Spatio-Temporal Adaptive Module for Improving Static Convolutions in Action Recognition

Wei Li,Weijun Gong,Yurong Qian,Haichen Tian
DOI: https://doi.org/10.1007/s00371-023-03165-6
2024-01-01
Abstract:Temporal adaptive convolution has demonstrated superior performance over static convolution techniques in video understanding. However, it needs to be improved in long-time series modeling and multi-scale feature-map adaptation. To address these challenges, we introduce spatio-temporal hybrid adaptive convolution (STHAC), designed to enhance the spatio-temporal modeling capabilities of convolution. This is achieved by learning a set of spatio-temporal calibration filters to mitigate the spatial invariance intrinsic to static convolution methods. Specifically, STHAC learns a linear combination of N adaptive filters by parallelizing two lightweight attention branches. The resulting linearly mixed filters incorporate spatial multi-scale prior knowledge and long-range temporal dependencies. These spatio-temporal calibration filters modulate each frame’s static convolutional weight parameters, thereby endowing static convolution with spatial multi-scale adaptability and long-range temporal modeling capabilities. Compared to other dynamic convolution methods, our proposed calibration filters require fewer parameters and incur lower computational complexity. Moreover, we introduce an Omni-dimensional aggregation module to augment the spatio-temporal modeling capacity of STHAC. When combined with STHAC, this aggregation module forms the spatio-temporal adaptive module (STAM) that can replace static convolution. We implement a spatio-temporal dynamic network based on STAM to validate our approach. Experimental results indicate that our model is competitive with state-of-the-art convolutional neural network architectures on action recognition benchmarks such as Kinetics-400(K400) and Something-Something V2(SSV2).
What problem does this paper attempt to address?