Good Practices for Learning to Recognize Actions Using FV and VLAD

Jianxin Wu,Yu Zhang,Weiyao Lin
DOI: https://doi.org/10.1109/tcyb.2015.2493538
IF: 11.8
2015-01-01
IEEE Transactions on Cybernetics
Abstract:High dimensional representations such as Fisher vectors (FV) and vectors of locally aggregated descriptors (VLAD) have shown state-of-the-art accuracy for action recognition in videos. The high dimensionality, on the other hand, also causes computational difficulties when scaling up to large-scale video data. This paper makes three lines of contributions to learning to recognize actions using high dimensional representations. First, we reviewed several existing techniques that improve upon FV or VLAD in image classification, and performed extensive empirical evaluations to assess their applicability for action recognition. Our analyses of these empirical results show that normality and bimodality are essential to achieve high accuracy. Second, we proposed a new pooling strategy for VLAD and three simple, efficient, and effective transformations for both FV and VLAD. Both proposed methods have shown higher accuracy than the original FV/VLAD method in extensive evaluations. Third, we proposed and evaluated new feature selection and compression methods for the FV and VLAD representations. This strategy uses only 4% of the storage of the original representation, but achieves comparable or even higher accuracy. Based on these contributions, we recommend a set of good practices for action recognition in videos for practitioners in this field.
What problem does this paper attempt to address?