Deep Local Video Feature for Action Recognition

Zhenzhong Lan,Yi Zhu,Alexander G. Hauptmann,Shawn Newsam
DOI: https://doi.org/10.1109/cvprw.2017.161
2017-07-01
Abstract:We investigate the problem of representing an entire video using CNN features for human action recognition. End-to-end learning of CNNIRNNs is currently not possible for whole videos due to GPU memory limitations and so a common practice is to use sampled frames as inputs along with the video labels as supervision. However, the global video labels might not be suitable for all of the temporally local samples as the videos often contain content besides the action of interest. We therefore propose to instead treat the deep networks trained on local inputs as local feature extractors. The local features are then aggregated to form global features which are used to assign video-level labels through a second classification stage. This framework is more robust to the noisy local labels that result from propagating video-level labels. We investigate a number of design choices for this local feature approach such as the optimal sampling and aggregation methods. Experimental results on the HMDB51 and UCF101 datasets show that a simple maximum pooling on the sparsely sampled locol features leads to significant performance improvement.
What problem does this paper attempt to address?