Action Recognition and Localization with Instance FCNN

Jialin Wu,Yu Yang,He Jiang,Yi Li,Guijin Wang,Xiangyang Ji
DOI: https://doi.org/10.1109/rcar.2018.8621829
2018-01-01
Abstract:To cooperate with robots, it is crucial to have them recognize humans action. However, in robotic vision, action recognition task remains challenging since videos are often untrimmed and arbitrary long with multiple action instances and background clutter. To address this issue, we propose an end-to-end fully convolutional spatio-temporal localization framework consisting of a proposal branch and a recognition branch. On the one hand, in the proposal branch, pixel-level score maps are regressed to model temporal boundaries and spatial instance-aware sub-tubes for each action and generate potential action tubes. On the other hand, we regress another category score maps to indicate the confidence of different action type for each potential action tube. Our method is of significant efficiency because we can obtain all proposals by processing each frame once. We evaluated our networks in two well-known public datasets MSR-II [1] and UT-interaction [2]: our result outperform state-of-the-art methods in a large margin at various spatio-temporal threshold α in multiple evaluation metrics.
What problem does this paper attempt to address?