Abstract:Fine-grained action recognition is one of the critical problems in video processing, which aims to recognize similar actions of subtle interactions between humans and objects. Inspired by the remarkable performance of the Transformer in natural language processing , Transformer has been applied to the fine-grained action recognition task. However, Transformer needs abundant training data and extra supervision to achieve comparable results with convolutional neural networks (CNNs). To address these issues, we propose a Convolutional Transformer Network (CTN), which integrates the merits of CNN ( e.g. , sharing weights, capturing low-level features in videos and locality) and the benefits of Transformer ( e.g ., dynamic attention and learning long-range dependencies). In this paper, we propose two modifications to the original Transformer: (i) We propose a video-to-tokens module that can extract tokens from extracted spatial-temporal features in videos by 3D convolutions instead of the direct token embedding from raw input video clips; (ii) We completely replace the linear mapping in multi-head self-attention layer with depth-wise convolutional mapping, which applies a depth-wise separable convolution operation on embedded token maps. With these two modifications, our approach can extract effective spatial-temporal features from videos and process the long sequences of tokens encountered in videos. Experimental results demonstrate that our proposed CTN can achieve state-of-the-art accuracy on two fine-grained action recognition datasets ( i.e ., Epic-Kitchens and Diving 48) with a small computational increase.

MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recognition.

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Temporal Distinct Representation Learning for Action Recognition

MULTI-DIRECTIONAL CONVOLUTION NETWORKS WITH SPATIAL-TEMPORAL FEATURE PYRAMID MODULE FOR ACTION RECOGNITION

Separable ConvNet Spatiotemporal Mixer for Action Recognition

TEINet: Towards an Efficient Architecture for Video Recognition.

3D-TDC: A 3D temporal dilation convolution framework for video action recognition

Spatio-Temporal Collaborative Module for Efficient Action Recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection

Convolutional transformer network for fine-grained action recognition

Multi-scale Spatial-Temporal Integration Convolutional Tube for Human Action Recognition

Sparse Temporal Causal Convolution for Efficient Action Modeling

Efficient spatio-temporal network for action recognition

Mixed Resolution Network with Hierarchical Motion Modeling for Efficient Action Recognition

Temporal Pyramid Pooling-Based Convolutional Neural Network for Action Recognition

ACTION-Net: Multipath Excitation for Action Recognition

STCA: an action recognition network with spatio-temporal convolution and attention

MiCT: Mixed 3D/2D Convolutional Tube for Human Action Recognition

Cross-Fiber Spatial-Temporal Co-enhanced Networks for Video Action Recognition