Joint Learning of NNeXtVLAD, CNN and Context Gating for Micro-Video Venue Classification.

Wei Liu,Xianglin Huang,Gang Cao,Jianglong Zhang,Gege Song,Lifang Yang
DOI: https://doi.org/10.1109/access.2019.2922430
IF: 3.9
2019-01-01
IEEE Access
Abstract:Currently, micro-videos have grown explosively on various online social platforms. Accordingly, how to encode them to yield effective representation attracts our attention. NeXtVLAD is such an effective network that aggregates frame-level features into a compact supervector. However, the discriminant capability of such a supervector is still limited due to the lack of non-linear transformation and L2 normalization at the head and tail of original NeXtVLAD network, respectively. In order to address such problems, we propose an improved neural network architecture, normalized NeXtVLAD (NNeXtVLAD), which is extended with ReLU function and L2 normalization. In the light of such a new network, we build up an end-to-end framework which jointly learns NNeXtVLAD, CNN layer, and context gating for micro-video venue classification. Specifically, we first apply NNeXtVLAD layers as three-stream architecture to aggregate visual, acoustic, and textual features. We then pack and embed the aggregated features into CNN layer for enhancing the sparse concept-level representation. Finally, context gating is used to capture the interdependency among different network activations. Extensive experimental results on a real-world micro-video dataset exhibit that our proposed model significantly outperforms the state-of-the-art baselines in terms of both Micro-F1 and Macro-F1 scores.
What problem does this paper attempt to address?