Spatial Mask ConvLSTM Network and Intra-Class Joint Training Method for Human Action Recognition in Video.

Jingjun Chen,Yonghong Song,Yuanlin Zhang
DOI: https://doi.org/10.1109/icme.2019.00185
2019-01-01
Abstract:For action recognition, attention model is widely used, but most of them lack consideration of the relationship of spatial and temporal information. We thus propose a Spatial Mask ConvLSTM Network (SMConvLSTM-Net) to determine the attention score of each pixel position. SMConvLSTM-Net is used to combine the information of space and time for getting more precise spatial mask, which has a long receptive field in time domain. Furthermore, to combine the connection of different samples from same category, a novel training method called intra-class joint training method is proposed to make network extract the common characteristics related to actions of the same class in different background. Extensive experiments illustrate the effectiveness of our method and our method significantly outperforms the baseline C3D network on UCF101 and HMDB51. Moreover, our approach achieves the best performance on UCF101 and a compared result on HMDB51 in comparison to some state-of-the-art approaches with RGB input.
What problem does this paper attempt to address?