Realistic Human Action Recognition: when Deep Learning Meets VLAD

Lei Zhang,Yangyang Feng,Jiqing Han,Xiantong Zhen
DOI: https://doi.org/10.1109/icassp.2016.7471897
2016-01-01
Abstract:Human action recognition from realistic scenarios is extremely challenging due to large intra-class variation and complex background clutters. In this paper, by leveraging the strength of deep learning and vector of locally aggregated descriptors (VLAD), we propose a new methods for human action recognition from realistic datsets. We adopt stack convolu-tional independent subspace analysis (ISA) networks to learn 3D cuboid representation directly from spatio-temporal video data; we propose an improved VLAD by incorporating the spatio-temporal geometrical information to encode the deep learned local features. On two challenging realistic datasets: the YouTube action and HMDB51 datasets, the proposed method achieves state-of-the-art performance with an efficient linear SVM classifier, which is competitive with and even better than existing sophisticated algorithms.
What problem does this paper attempt to address?