Multi-cue Combination Network for Action-Based Video Classification.

Yan Tian,Yifan Cao,Jiachen Wu,Wei Huz,Chao Song,Tao Yang
DOI: https://doi.org/10.1049/iet-cvi.2018.5492
IF: 1.484
2019-01-01
IET Computer Vision
Abstract:Action-based video classification (or video-based action recognition) is an active research area in computer vision. However, all currently utilised action-based video classification approaches take spatial and temporal components into consideration while acoustic features (e.g. sound and speech) are neglected. In this study, the authors propose a novel approach to combine multiple cues (i.e. both visual and acoustic information) for action-based video classification. Additionally, they introduce dense connections into their three-stream network to address the gradient vanishing problem. Experimental results in the Kinetics Human Action Video data set and the Kinetics-Sounds data set shows that their approach can effectively improve the accuracy in action-based video classification.
What problem does this paper attempt to address?