Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classification

Wei Liu,Xianglin Huang,Gang Cao,Jianglong Zhang,Gege Song,Lifang Yang
DOI: https://doi.org/10.1007/s11042-019-08147-2
IF: 2.577
2019-01-01
Multimedia Tools and Applications
Abstract:With the large amount of micro-videos available in social network applications, micro-video venue category provides extremely valuable venue information that assists location-oriented applications, personalized services, etc. In this paper, we formulate micro-video venue classification as a multi-modal sequential modeling problem. Unlike existing approaches that use long short-term memory (LSTM) models to capture temporal patterns for micro-video, we propose multi-modality sequence model with gated fully convolutional blocks. Specifically, we firstly adopt three parallel gated fully convolutional blocks to extract spatiotemporal features from visual, acoustic and textual modalities of micro-videos. Then, an additional gated fully convolutional block is used to fuse such three modalities of spatiotemporal features. Finally, corresponding prototype is simultaneously learned to improve the robustness against softmax classification function. Extensive experimental results on a real-world benchmark dataset demonstrate the effectiveness of our model in terms of both Micro-F and Macro-F scores.
What problem does this paper attempt to address?