Ensembling 3d Cnn Framework For Video Recognition

Ruolin Huang,Hongbin Dong,Guisheng Yin,Qiang Fu
DOI: https://doi.org/10.1109/IJCNN.2019.8852054
2019-01-01
Abstract:Video-based behavior recognition is a challenging research topic. The three dimensional convolution neural network (3D CNN) is effectively adopted to capture features from videos directly. 3D CNN is extended by two-dimensional convolution neural network, in which a time dimension is added. 3D CNN is better than two-dimensional convolution network in expressing effective motion information, and it has certain advantages. In order to make better use of the valuable features extracted from the original video information, only stacked RGB frame data sets can be used as the input of network. Ensembling 3D CNN framework for video recognition is proposed in the paper. Firstly, the pre-training model of Sports-1M is initialized firstly, and a 3D convolution neural network based on multi-level feature fusion is constructed.. The final high-dimensional feature combination is obtained by fusing multiple convolution features. Then 3D convolutional neural network based on ensemble learning is proposed to increase motion information, enrich motion features and enhance the robustness of single feature representation. Three incomplete training data sets are obtained by Bagging algorithm. To get different networks, three data sets are employed to train three 3D convolution neural networks respectively, and the output of the three networks is integrated. The output features of the three networks are input into the SVM classifier through the Stacking algorithm and the final results are obtained. The integration effects of different ensemble methods are compared. The experimental results show that the method of this work can improve recognition accuracy on UCF-101 data set effectively.
What problem does this paper attempt to address?