Abstract:Massive progress for vision-based action recognition has been made in the last few years, owing to the advancement of deep convolutional neural networks (CNNs). In contrast with 2D CNN-based approaches, 3D CNN-based approaches can effectively capture spatial and temporal features. However, they are computationally intensive. To boost 2D-CNN performance, most of the existing methods leverage channel attention (e.g. squeeze and excitation), which despite its strong impact on the model performance, operates only on the channel space and ignores the spatial space. In this work, we design a generic and collaborative excitation module, namely the Collaborative Positional-Motion Excitation Module (CPME) for action recognition. CPME is a dual-pathway excitation module designed to embed the crucial types of information, mainly the positional information and the motion information, for efficient action recognition. Positional Enhancement Pathway (PEP), the first pathway of CPME, considers encoding direction-aware and position-sensitive information. Motion Enhancement Pathway (MEP), the second pathway, encodes the motion information by emphasizing the informative features in each frame and excite motion-sensitive channels. We integrate the proposed CPME into 2D CNNs to form a simple yet effective CPME-Net with limited extra computational cost. Finally, a discriminative and diverse video-level representation for action recognition is generated by end-to-end training. Experiments on two popular action recognition datasets demonstrate that CPME blocks bring performance improvements on 2D CNN baseline, and our method achieves competitive results against the state-of-the-art methods.

Multi-Kernel Excitation Network for Video Action Recognition

ACTION-Net: Multipath Excitation for Action Recognition

Temporal Interaction and Excitation for Action Recognition

Multi-level Channel Attention Excitation Network for Human Action Recognition in Videos

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Multipath Attention and Adaptive Gating Network for Video Action Recognition

Temporal Distinct Representation Learning for Action Recognition

MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recognition.

Two-Path Motion Excitation for Action Recognition

Multi‐mode Neural Network for Human Action Recognition

A Spatio-temporal Hybrid Network for Action Recognition

Multi-level Three-Stream Convolutional Networks for Video-Based Action Recognition

Efficient temporal-spatial feature grouping for video action recognition

Multi-Level Recurrent Residual Networks for Action Recognition

T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition

Collaborative Positional-Motion Excitation Module for Efficient Action Recognition.

Deep Multi-Kernel Convolutional LSTM Networks and an Attention-Based Mechanism for Videos

Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition

NAS-TC: Neural Architecture Search on Temporal Convolutions for Complex Action Recognition

Multi-Branch Spatial-Temporal Network for Action Recognition

MULTI-DIRECTIONAL CONVOLUTION NETWORKS WITH SPATIAL-TEMPORAL FEATURE PYRAMID MODULE FOR ACTION RECOGNITION