Cross-Fiber Spatial-Temporal Co-enhanced Networks for Video Action Recognition

Haoze Wu,Zheng-Jun Zha,Xin Wen,Zhenzhong Chen,Dong Liu,Xuejin Chen
DOI: https://doi.org/10.1145/3343031.3350891
2019-01-01
Abstract:The 3D convolutional neural networks recently have been applied to explore spatial-temporal content for video action recognition. However, they either suffer from high computational cost by spatial-temporal feature extraction or ignore the correlation between appearance and motion. In this work, we propose a novel Cross-Fiber Spatial-Temporal Co-enhanced (CFST) architecture aiming to reduce the number of parameters tremendously while achieve accurate recognition of actions. We slice the complex 3D convolutional network into a group of lightweight fibers that run through the whole network. Crossing separated fibers, we introduce the Cross-Fiber Recalibration unit which shares extracted features from each fiber and measures the interaction between fibers to emphasize informative ones. Within each fiber, the Spatial-Temporal Co-enhanced unit is put forward to co-enhance the learning of spatial and temporal features, leading to more discriminative spatial-temporal representation. An end-to-end deep network, CFST-Net, is also presented based on the proposed CFST architecture for video action recognition. Extensive experimental results show that our CFST-Net significantly boosts the performance of existing convolution networks and achieves state-of-the-art accuracy on three challenging benchmarks, i.e., UCF-101, HMDB-51 and Kinetics-400, with much fewer parameters and FLOPs.
What problem does this paper attempt to address?