Abstract:Action recognition in videos is a difficult and challenging task. Recent developed deep learning-based action recognition methods have achieved the state-of-the-art performance on several action recognition benchmarks. However, it is noted that these methods are inefficient since they are of large model size and require long runtime which restrict their practical applications. In this study, we focus on improving the accuracy and efficiency of action recognition following the two-stream ConvNets by investigating the effective video-level representations. Our motivation stems from the observation that redundant information widely exists in adjacent frames in the videos and humans do not recognize actions based on frame-level features. Therefore, to extract the effective video-level features, a Hierarchical Temporal Pooling (HTP) module is proposed and a two-stream action recognition network termed as HTP-Net (Two-stream) is developed, which is carefully designed to obtain effective video-level representations by hierarchically incorporating the temporal motion and spatial appearance features. It is worth noting that all two-stream action recognition methods using optical flow as one of the inputs are computationally inefficient since calculating optical flow is time-consuming. To improve the efficiency, in our study, we do not consider using optical flow but consider only raw RGB as input to our HTP-Net termed as HTP-Net (RGB) for a clear and concise presentation. Extensive experiments have been conducted on two benchmarks: UCF101 and HMDB51. Experimental results demonstrate that HTP-Net (Two-stream) achieves the state-of-the-art performance and HTP-Net (RGB) offers competitive action recognition accuracy but is approximately 1-2 orders of magnitude faster than other state-of-the-art single stream action recognition methods. Specifically, our HTP-Net (RGB) runs at 42 videos per second (vps) and 672 frames per second (fps) on an NVIDIA Titan X GPU, which enables real-time action recognition and is of great value in practical applications.

Pooling the Convolutional Layers in Deep ConvNets for Video Action Recognition

3D Convolutional Two-Stream Network for Action Recognition in Videos

Stratified Pooling Based Deep Convolutional Neural Networks for Human Action Recognition

Temporal Pyramid Pooling-Based Convolutional Neural Network for Action Recognition

Action Recognition Using Multiple Pooling Strategies of CNN Features

The Very Deep Multi-stage Two-stream Convolutional Neural Network for Action Recognition

Body Joint Guided 3-D Deep Convolutional Descriptors for Action Recognition

Hierarchical Temporal Pooling for Efficient Online Action Recognition.

Convolutional Gated Recurrent Units Fusion For Video Action Recognition

Going Deeper with Two-Stream ConvNets for Action Recognition in Video Surveillance

Local Fusion Networks with Chained Residual Pooling for Video Action Recognition

Video Action Recognition Based on Deeper Convolution Networks with Pair-Wise Frame Motion Concatenation

End-to-end Video-level Representation Learning for Action Recognition

Order-aware Convolutional Pooling for Video Based Action Recognition

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Fully Convolutional Networks for Action Recognition

Action Recognition in Videos with Temporal Segments Fusions

Human Action Recognition Combining Sequential Dynamic Images and Two-Stream Convolutional Network

Towards Good Practices for Very Deep Two-Stream ConvNets.

Multi-level Three-Stream Convolutional Networks for Video-Based Action Recognition

Two-Stream Convolutional Neural Network for Video Action Recognition