Rank Pooling Dynamic Network: Learning End-to-end Dynamic Characteristic for Action Recognition

Zhigang Zhu,Hongbing Ji,Wenbo Zhang,Yiping Xu
DOI: https://doi.org/10.1016/j.neucom.2018.08.018
IF: 6
2018-01-01
Neurocomputing
Abstract:In video recognition, rank-pooling operators are a type of models for sorting video sequences, which act on either the raw inputs or the intermediate feature maps of convolutional neural network (CNN). However, such models are currently restricted in the optimization of the linear ranking function by Rank-SVM and Rank-SVR. In this paper, we first propose a CNN architecture called RGB Rank Pooling Dynamic Network (RGB-RPDN), mapping a video to multiple frame-level dynamic spaces with the same size as the input. Importantly, a classical classification (e.g. FC, CNN) advanced in 2D image can be jointly positioned behind the generated representation for action classification, thus the joint architecture can be trained in an end-to-end manner. Second, we analyze how the flow-level evolution can be modeled by the hand-crafted rank-pooling machine, and extend the dynamic space of frame-level to that of flow-level by the Flow Rank Pooling Dynamic Network (Flow-RPDN). Third, equivalence relations between hand-crafted rank-pooling and RPDN are formulated, further the comparison of computing cost are qualitatively analyzed. Finally, the frame-level and flow-level pipelines are combined to achieve the final prediction by the late fusion. Specifically, with the models pre-trained on the large-scale Kinetics dataset, we train the two-stream RPDN on the UCF101 and HMDB51, where the parameters are initialized by the pre-trained models above. Experimental results demonstrate that the RPDN significantly improves the hand-crafted rank-pooling machines by a large margin of promotion, and achieves the correct rate of more excellent classification in action recognition.
What problem does this paper attempt to address?