Convolutional Gated Recurrent Units Fusion For Video Action Recognition

Bo Huang,Hualong Huang,Hongtao Lu
DOI: https://doi.org/10.1007/978-3-319-70090-8_12
2017-01-01
Abstract:Two-stream Convolutional Networks (ConvNets) have achieved great success in video action recognition. Research also shows that early fusion of the two-stream ConvNets can further boost the performance. Existing fusion methods focus on short snippets thus fails to learn global representations for videos. We introduce a Convolutional Gated Recurrent Units (ConvGRU) fusion method to model long-term dependency inside actions. This fusion method takes advantage of both Recurrent Neural Networks (RNN) models which have strong capacity to handle long-term dependency for sequence modeling and early fusion architecture which learns the evolution of appearance feature and motion feature. We further propose an end-to-end architecture according to this fusion method and evaluate our approach using a widely used action recognition dataset named UCF101. We investigate different input lengths and fusion layers and find that fusing at the last convolutional layer with an input length of 10 entries yields best performance (93.0%) which is comparable to the state-of-the-art.
What problem does this paper attempt to address?