Abstract:Video venue category prediction has been drawing more attention in the multimedia community for various applications such as personalized location recommendation and video verification. Most of existing works resort to the information from either multiple modalities or other platforms for strengthening video representations. However, noisy acoustic information, sparse textual descriptions and incompatible cross-platform data could limit the performance gain and reduce the universality of the model. Therefore, we focus on discriminative visual feature extraction from videos by introducing a hybrid-attention structure. Particularly, we propose a novel Global-Local Attention Module (GLAM), which can be inserted to neural networks to generate enhanced visual features from video content. In GLAM, the Global Attention (GA) is used to catch contextual scene-oriented information via assigning channels with various weights while the Local Attention (LA) is employed to learn salient object-oriented features via allocating different weights for spatial regions. Moreover, GLAM can be extended to ones with multiple GAs and LAs for further visual enhancement. These two types of features respectively captured by GAs and LAs are integrated via convolution layers, and then delivered into convolutional Long Short-Term Memory (convLSTM) to generate spatial-temporal representations, constituting the content stream. In addition, video motions are explored to learn long-term movement variations, which also contributes to video venue prediction. The content and motion stream constitute our proposed Hybrid-Attention Enhanced Two-Stream Fusion Network (HA-TSFN). HA-TSFN finally merges the features from two streams for comprehensive representations. Extensive experiments demonstrate that our method achieves the state-of-the-art performance in the large-scale dataset Vine. The visualization also shows that the proposed GLAM can capture complementary scene-oriented and object-oriented visual features from videos. Our code is available at: https://github.com/zhangyanchao1014/HA-TSFN.

Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classification

Attention-enhanced Joint Learning Network for Micro-Video Venue Classification

Hybrid-attention and frame difference enhanced network for micro-video venue recognition

Enhancing Micro-Video Venue Recognition via Multi-Modal and Multi-Granularity Object Relations

Multimodal Progressive Modulation Network for Micro-video Multi-label Classification

Learning Dual Low-Rank Representation for Multi-Label Micro-Video Classification.

Attention-enhanced and trusted multimodal learning for micro-video venue recognition

Towards Micro-video Understanding by Joint Sequential-Sparse Modeling

Enhancing Micro-video Understanding by Harnessing External Sounds

Multimodal Attentive Representation Learning for Micro-video Multi-label Classification

Hierarchy-Dependent Cross-Platform Multi-View Feature Learning for Venue Category Prediction

Online Data Organizer: Micro-Video Categorization by Structure-Guided Multimodal Dictionary Learning.

Shorter-is-Better

Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

Hybrid-Attention Enhanced Two-Stream Fusion Network for Video Venue Prediction.

Multimodal Semantic Enhanced Representation Network for Micro-Video Event Detection

Multimodal Deep Representation Learning for Video Classification

Context-aware focal alignment network for micro-video multi-label classification

Multi-modal gated recurrent units for image description

Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification.