Action Recognition Based on Two-Stream Convolutional Networks with Long-Short-Term Spatiotemporal Features

Yanqin Wan,Zujun Yu,Yao Wang,Xingxin Li
DOI: https://doi.org/10.1109/access.2020.2993227
IF: 3.9
2020-01-01
IEEE Access
Abstract:Human action recognition is an important research topic in the field of computer vision due to its application values. Recently, a variety of approaches based on deep learning features have been proposed due to the effectiveness of deep neural networks. But most of these approaches are not able to fully extract spatiotemporal features from videos, because of the lack of consideration of the diversity of scales in temporal domain. In this paper, we propose a two-stream convolutional network with long-short-term spatiotemporal features (LSF CNN) for human action recognition task. The network is mainly composed of two subnetworks. One is long-term spatiotemporal features extraction network (LT-Net) that takes the stacked RGB images as inputs. Another one is short-term spatiotemporal features extraction network (ST-Net) that takes the optical flow as input, which is estimated from two adjacent frames. The two-scale spatiotemporal features are fused in the fully-connected layer and fed into the linear support vector machine (SVM). We also propose a new expression for optical flow field, which is proved to have better performance than traditional expression in action recognition problem. With two-stream architecture, the network can fully learn deep features in both spatial and temporal domains. The experimental results on HMDB51 and UCF101 datasets indicated that the proposed approach improves the action recognition accuracy by using the long-short-term spatiotemporal information.
What problem does this paper attempt to address?