Improving Action Recognition with Valued Patches Exploiting

Wu Luo,Chongyang Zhang,Weiwei Liu,Jintao Wu,Weiyao Lin
DOI: https://doi.org/10.1109/bigmm.2019.00-27
2019-01-01
Abstract:Recent human action recognition methods mainly model a two-stream or multi-stream deep neural network, with which human spatiotemporal features can be exploited effectively. However, due to the ignoring of interactive scenes, most of these methods cannot achieve impressive performance. In this paper, we propose a novel multi-stream fusion framework based on discriminative scene patches and motion patches. Unlike existing two-stream or multi-stream methods, our work improves the accuracy by 1) Attaching more attention to the exploiting of discriminative scene patches and motion patches. 2) Proposing a novel 2D+3D multi-stream feature aggregation mechanism: 2D features from RGB images and 3D features of valued patches are combined to improve the representation of spatiotemporal features. Our framework is evaluated on three widely used video action benchmarks, where it outperforms other state-of-the-art recognition approaches by a significant margin: the accuracy up to 85.7% at JHMDB, 87.7% at HMDB51, and 98.6% at UCF101, respectively.
What problem does this paper attempt to address?