Multi-level Three-Stream Convolutional Networks for Video-Based Action Recognition

Yijing Lv,Huicheng Zheng,Wei Zhang
DOI: https://doi.org/10.1007/978-3-030-03335-4_21
2018-01-01
Abstract:Deep convolutional neural networks (ConvNets) have shown remarkable capability for visual feature learning and representation. In the field of video-based action recognition, much progress has been made with the development of ConvNets. However, main-stream ConvNets used for video-based action recognition, such as two-stream ConvNets and 3D ConvNets, still lack the ability to represent fine-grained features. In this paper, we propose a novel architecture named multi-level three-stream convolutional network (MLTSN), which contains three streams, i.e., the spatial stream, the temporal stream, and the multi-level correlation stream (MLCS). The MLCS contains several correlation modules, which fuse appearance and motion features at the same levels and obtain spatial-temporal correlation maps. The correlation maps will further be fed in several convolution layers to get refined features. The whole network is trained in a multi-step modality. Extensive experimental results show that the performance of the proposed network is competitive to state-of-the-art methods on HMDB51 and UCF101.
What problem does this paper attempt to address?