Abstract:Contextual information is essential in action recognition. However, local operations have difficulty in modeling two distant elements, and directly computing the dense relations between any two points brings huge computation and memory burden. Inspired by the recurrent 2D criss-cross attention (RCCA-2D) in image segmentation, we propose a recurrent 3D criss-cross attention (RCCA-3D) that factorizes the global relation map into sparse relation maps to model long-range spatiotemporal context with minor costs for video-based action recognition. Specifically, we first propose a 3D criss-cross attention (CCA-3D) module. Compared with the CCA-2D which only works in space, it can capture the spatiotemporal relationship between the points in the same line along the direction of width, height and time. However, only replacing the two CCA-2Ds in the RCCA-2D with our CCA-3Ds cannot model the spatiotemporal context in videos. Therefore, we further duplicate the CCA-3D with a recurrent mechanism to transmit the relation between the points in a line to a plane and finally to the whole spatiotemporal space. To make the RCCA-3D adaptive for action recognition, we propose a novel recurrent structure rather than directly extending the original 2D structure to 3D. In the experiments, we make a thorough analysis of different structures of RCCA-3D, verifying the proposed structure is more suitable for action recognition. We also compare our RCCA-3D with the non-local attention, showing that the RCCA-3D requires 25% fewer parameters and 30% fewer FLOPs with even higher accuracy. Finally, equipped with our RCCA-3D, 3 networks achieve better and leading performance on 5 RGB-based and skeleton-based datasets.

Spatio-Temporal Triangular-Chain Crf For Activity Recognition

Learning Visual Context for Group Activity Recognition.

Hierarchical Complex Activity Representation and Recognition Using Topic Model and Classifier Level Fusion.

A Hierarchical Spatio-Temporal Model for Human Activity Recognition.

Human Activity Recognition based on Dynamic Spatio-Temporal Relations

Efficient Spatialtemporal Context Modeling for Action Recognition

Efficient spatiotemporal context modeling for action recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

Online Robust Action Recognition Based on a Hierarchical Model

R-C3D: Region Convolutional 3D Network for Temporal Activity Detection

3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Multi-dimensional convolution transformer for group activity recognition

Grouped Spatial-Temporal Aggregation for Efficient Action Recognition

Actor-Multi-Scale Context Bidirectional Higher Order Interactive Relation Network for Spatial-Temporal Action Localization.

Activity Recognition Using Dense Long-Duration Trajectories

Motion Complement and Temporal Multifocusing for Skeleton-Based Action Recognition

Spatiotemporal Multi-Task Network for Human Activity Understanding.

Skeleton-based action recognition with hierarchical spatial reasoning and temporal stack learning network

Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning