Spatiotemporal Multi-Task Network for Human Activity Understanding.

Yao Liu,Jianqiang Huang,Chang Zhou,Deng Cai,Xian-Sheng Hua
DOI: https://doi.org/10.1145/3126686.3126705
2017-01-01
Abstract:Recently, remarkable progress has been achieved in human action recognition and detection by using deep learning techniques. However, for action detection in real-world untrimmed videos, the accuracies of most existing approaches are still far from satisfactory, due to the difficulties in temporal action localization. On the other hand, the spatiotempoal features are not well utilized in recent work for video analysis. To tackle these problems, we propose a spatiotemporal, multi-task, 3D deep convolutional neural network to detect (including temporally localize and recognition) actions in untrimmed videos. First, we introduce a fusion framework which aims to extract video-level spatiotemporal features in the training phase. And we demonstrate the effectiveness of video-level features by evaluating our model on human action recognition task. Then, under the fusion framework, we propose a spatiotemporal multi-task network, which has two sibling output layers for action classification and temporal localization, respectively. To obtain precise temporal locations, we present a novel temporal regression method to revise the proposal window which contains an action. Meanwhile, in order to better utilize the rich motion information in videos, we introduce a novel video representation, interlaced images, as an additional network input stream. As a result, our model outperforms state-of-the-art methods for both action recognition and detection on standard benchmarks.
What problem does this paper attempt to address?