Research on Video Retrieval Technology based on Multimodal Fusion and Attention Mechanism

Fanfeng Zeng,Tianyang Tai
DOI: https://doi.org/10.1145/3650400.3650477
2023-10-20
Abstract:Feature extraction and matching are crucial in video retrieval tasks. However, existing algorithms often overlook motion features in action-related videos and focus only on global static features. Distinguishing between key action features and background features is challenging, which hinders capturing global dependency relationships during the convolutional process. This results in less expressive features and reduced accuracy in video retrieval.In this paper, we propose a video retrieval model that combines multimodal fusion and attention mechanism. Our model employs the SlowFast backbone network, extracting skeleton motion features and static image features from video sequences using the Slow and Fast networks respectively. To address feature fusion, we introduce a 3D residual attention structure between the two branches. By incorporating bilateral connections and hash encoding, we construct a hash layer to map features into binary codes, improving retrieval efficiency.Experimental results on UCF101 and HMDB51 datasets validate the effectiveness of our approach, demonstrating its advantages over state-of-the-art video retrieval methods.
Computer Science
What problem does this paper attempt to address?