Abstract:The feature pyramid has been widely used in many visual tasks, such as fine-grained image classification, instance segmentation, and object detection, and had been achieving promising performance. Although many algorithms exploit different-level features to construct the feature pyramid, they usually treat them equally and do not make an in-depth investigation on the inherent complementary advantages of different-level features. In this article, to learn a pyramid feature with the robust representational ability for action recognition, we propose a novel collaborative and multilevel feature selection network (FSNet) that applies feature selection and aggregation on multilevel features according to action context. Unlike previous works that learn the pattern of frame appearance by enhancing spatial encoding, the proposed network consists of the position selection module and channel selection module that can adaptively aggregate multilevel features into a new informative feature from both position and channel dimensions. The position selection module integrates the vectors at the same spatial location across multilevel features with positionwise attention. Similarly, the channel selection module selectively aggregates the channel maps at the same channel location across multilevel features with channelwise attention. Positionwise features with different receptive fields and channelwise features with different pattern-specific responses are emphasized respectively depending on their correlations to actions, which are fused as a new informative feature for action recognition. The proposed FSNet can be inserted into different backbone networks flexibly, and extensive experiments are conducted on three benchmark action datasets, Kinetics, UCF101, and HMDB51. Experimental results show that FSNet is practical and can be collaboratively trained to boost the representational ability of existing networks. FSNet achieves superior performance against most top-tier models on Kinetics and all models on UCF101 and HMDB51.

Spatial-temporal Pyramid Based Convolutional Neural Network for Action Recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

Temporal Pyramid Pooling-Based Convolutional Neural Network for Action Recognition

Spatiotemporal Pyramid Network for Video Action Recognition

Spatiotemporal Pyramid Pooling In 3d Convolutional Neural Networks For Action Recognition

A Spatio-temporal Hybrid Network for Action Recognition

Temporal Pyramid Network for Action Recognition

A Two-Pathway Convolutional Neural Network with Temporal Pyramid Network for Action Recognition

Spatial-Temporal Pyramid Graph Reasoning for Action Recognition

Temporal adaptive feature pyramid network for action detection

Temporal Pyramid Pooling Based Relation Network For Action Recognition

Interaction-Aware Spatio-Temporal Pyramid Attention Networks for Action Classification

End-to-end Video-level Representation Learning for Action Recognition

A New Temporal Deconvolutional Pyramid Network For Action Detection

Temporal Spiking Recurrent Neural Network for Action Recognition.

Collaborative and Multilevel Feature Selection Network for Action Recognition.

T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition

Temporal Recursive Propagation Network for Action Recognition

ACTION-Net: Multipath Excitation for Action Recognition

Multi-level Temporal Pyramid Network for Action Detection