Recurrent Temporal Sparse Autoencoder for Attention-Based Action Recognition.

Miao Xin,Hong Zhang,Mingui Sun,Ding Yuan
DOI: https://doi.org/10.1109/ijcnn.2016.7727234
2016-01-01
Abstract:Visual context is fundamental to understand human actions in videos. However, to efficiently employ temporal context information presents an enormous challenge to this area. Two main problems are long-standing: (1) video frames are redundant while discriminative information is sparse; (2) large amount of interference information is mixed in frame sequences. These factors results in redundant computation and recognition failures. In this paper, we propose a learnable temporal attention mechanism to automatically select important time points from action sequences. We design an unsupervised Recurrent Temporal Sparse Autoencoder (RTSAE) network, which learns to extract sparse key-frames to sharpen discriminative yet to retain descriptive capability, as well to shield interfere information. By applying this technique to a recent proposed action recognition model Adaptive Recurrent-convolutional Hybrid network (ARCH), we significantly improve its performance in both speed and accuracy. Experiments demonstrate that, with the help of the RTSAE, ARCH outperforms most state-of-the-art methods on UCF101 and HMDB51 datasets.
What problem does this paper attempt to address?