A Multimodal Aggregation Network with Serial Self-Attention Mechanism for Micro-Video Multi-Label Classification

Wei Lu,Jiaxin Lin,Peiguang Jing,Yuting Su
DOI: https://doi.org/10.1109/lsp.2023.3240889
2023-01-01
IEEE Signal Processing Letters
Abstract:Currently, micro-videos have attracted increasing attention due to their unique properties and great commercial value. Considering that micro-videos naturally incorporate multimodal information, a powerful representation method for distinct joint multimodal representations is essential for real applications. Inspired by the potential of attention neural network architectures over various tasks, we propose a multimodal aggregation network (MANET) with a serial self-attention mechanism to perform tasks of micro-video multi-label classification. Specifically, we first propose a parallel content-dependent graph neural networks (CDGNN) module, which explores category-related embeddings of micro-videos by disentangling category relations into modality-specific and modality-shared category dependency patterns. Then we introduce a serial self-attention (SSA) module to transmit the multimodal information in sequential order, in which an aggregation bottleneck is incorporated to better collect and condense the significant information. Experiments conducted on a large-scale multi-label micro-video dataset demonstrate that our proposed method has achieved competitive results compared with several state-of-the-art methods.
What problem does this paper attempt to address?