Abstract:Contextual information is essential in action recognition. However, local operations have difficulty in modeling two distant elements, and directly computing the dense relations between any two points brings huge computation and memory burden. Inspired by the recurrent 2D criss-cross attention (RCCA-2D) in image segmentation, we propose a recurrent 3D criss-cross attention (RCCA-3D) that factorizes the global relation map into sparse relation maps to model long-range spatiotemporal context with minor costs for video-based action recognition. Specifically, we first propose a 3D criss-cross attention (CCA-3D) module. Compared with the CCA-2D which only works in space, it can capture the spatiotemporal relationship between the points in the same line along the direction of width, height and time. However, only replacing the two CCA-2Ds in the RCCA-2D with our CCA-3Ds cannot model the spatiotemporal context in videos. Therefore, we further duplicate the CCA-3D with a recurrent mechanism to transmit the relation between the points in a line to a plane and finally to the whole spatiotemporal space. To make the RCCA-3D adaptive for action recognition, we propose a novel recurrent structure rather than directly extending the original 2D structure to 3D. In the experiments, we make a thorough analysis of different structures of RCCA-3D, verifying the proposed structure is more suitable for action recognition. We also compare our RCCA-3D with the non-local attention, showing that the RCCA-3D requires 25% fewer parameters and 30% fewer FLOPs with even higher accuracy. Finally, equipped with our RCCA-3D, 3 networks achieve better and leading performance on 5 RGB-based and skeleton-based datasets.

Extreme Low-Resolution Action Recognition with Confident Spatial-Temporal Attention Transfer

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Leveraging cross-resolution attention for effective extreme low-resolution video action recognition

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Human Action Recognition Using Deep Learning Methods.

Online Robust Action Recognition Based on a Hierarchical Model

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Human Action Recognition From Digital Videos Based on Deep Learning.

CAST: Cross-Attention in Space and Time for Video Action Recognition

Human Action Recognition Based on Hierarchical Multi-Scale Adaptive Conv-Long Short-Term Memory Network

Semi-Coupled Two-Stream Fusion ConvNets for Action Recognition at Extremely Low Resolutions

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

STCA: an action recognition network with spatio-temporal convolution and attention

Mining Spatial and Spatio-Temporal ROIs for Action Recognition

Low-Latency Human Action Recognition with Weighted Multi-Region Convolutional Neural Network

Spatio-Temporal Adaptive Network with Bidirectional Temporal Difference for Action Recognition

Efficient spatiotemporal context modeling for action recognition

Weighted Multi-Region Convolutional Neural Network for Action Recognition with Low-Latency Online Prediction

Toward Accurate Person-level Action Recognition in Videos of Crowed Scenes