Spatiotemporal Pyramid Pooling In 3d Convolutional Neural Networks For Action Recognition

Cheng Cheng,Pin Lv,Bing Su
DOI: https://doi.org/10.1109/ICIP.2018.8451625
2018-01-01
Abstract:Deep 3-dimensional convolutional networks (3D ConvNets) trained on large scale video datasets have achieved promising results on action recognition. This paper improves their performance by taking into account the spatiotemporal pyramid pooling. Specifically, we propose the spatiotemporal pyramid pooling layer to tackle the temporal variations of video sequences. Based on this layer, we develop a new network architecture, called STPP-net, by incorporating it with 3D ConvNets. The proposed network is robust to spatial and temporal variation of human actions and can generate a fixed-dimensional representation regardless of video size/scale. We show that our new network architecture outperforms the original 3D ConvNets by a large margin on three large-scale video classification/action recognition benchmarks including HMDB51, UCF101, and Kinetics.
What problem does this paper attempt to address?