Enhancing Micro-Video Venue Recognition via Multi-Modal and Multi-Granularity Object Relations
Weijia Liu,Jiuxin Cao,Ran Wei,Xuelin Zhu,Bo Liu
DOI: https://doi.org/10.1109/tcsvt.2023.3349202
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Micro-video venue recognition aims to predict the venue category where a micro-video was filmed. Different from traditional long videos which contain rich temporal context, venue prediction for micro-videos is difficult due to its limited duration (generally within 6s). The existing works usually extract features of each modality from a global perspective for prediction, neglecting the semantics carried by local objects. To this end, we propose Multi-Modal and Multi-Granularity Object Relations (M2ORE) to address the above issues, which learns multi-granularity interactive semantics between venues and multimodal semantic objects to help understand venues. Specifically, M2ORE comprises of two modules: it first extract semantic objects of different modalities, i.e. visual objects in keyframes and keywords in texts, and models the affiliation relationship between semantic objects and venues and the co-occurrence relationship among semantic objects, forming a heterogeneous venue-object relation graph. Then, to achieve the interactive semantics between venues and objects from the relation graph, a novel Parallel-Graph Inference Model (Parallel-GIM) is proposed, which updates the representation of nodes through graph propagation and fuse multi-level features (local-global-multimodal) through the devised hierarchical attention mechanism. Finally, the probability distribution of venues can be obtained through a multi-layer perceptron with the comprehensive features of the venue nodes. Extensive experiments on real-world micro-video dataset demonstrate the superiority of the proposed M2ORE.
engineering, electrical & electronic