3-Stream Convolutional Networks for Video Action Recognition with Hybrid Motion Field

Wukui Yang,Shan Gao,Wenran Liu,Xiangyang Ji
DOI: https://doi.org/10.1109/MMSP.2018.8547088
2018-01-01
Abstract:Two-stream based architectures for video action recognition exhibit great success recently. They encode the appearance with RGB frame, and the motion with optical flow. It is observed that optical flow depicts pixel-level motion field, focusing much on detail information, is hard to tackle the large displacement. In fact, human always focus the global motion rather than pixel-level motion. Inspired by this, we propose a novel 3-stream network structure with a spatial ConvNet, a pixel-level temporal ConvNet and a block-level temporal ConvNet. Integrating multi-granularity motion representation significantly outperforms single pixel-level motion field based architectures. Further, we can obtain the block-level motion vector field from compressed videos without extra calculation. We address missing and noisy motion patterns of motion vector field with intra-encoded block rectifying and flow guided filtering, building a hybrid motion field for our block-level temporal ConvNet. Our approach obtains state-of-the-art accuracy on UCF101 (95.27%) and HMDB 51 (69.21 %).
What problem does this paper attempt to address?