Visual-Semantics Embedding for Deep Hashing-Based Multi-Label Video Retrieval

Yuanhao Yue,Qin Zou,Ling Cao,Hongkai Yu,Chi Chen,Na Li
DOI: https://doi.org/10.2139/ssrn.4966047
2024-01-01
Abstract:With the explosive growth of short videos in mobile multimedia systems, there is an urgent need for efficient video retrieval. In the past decade, deep hashing technology has occupied a dominant position in content-based visual retrieval and has achieved remarkable success in image retrieval. However, for video retrieval, existing hashing-based methods still have a lot of room for improvement, especially concerning multi-label videos. There are mainly two reasons: 1) The similarity is not well-defined for multi-label videos. Traditionally, pairwise similarity is defined purely based on the explicit text information of the labels, which ignores the implicit semantic relations between them, leading to inaccurate distance measurements; 2) Time-series information of video frames is not properly used in feature extraction. Most methods assign equal weight to each frame in their models, neglecting the fact that the content of a video is often determined by several keyframes. To solve the above problems, a novel video hashing method is proposed in this paper. First, a visual-semantics embedding soft similarity is developed to calculate the distance between pairwise videos, where the implicit semantic association of labels is learned by a graph convolution network (GCN). Second, a hybrid attention module, consisting of a self-attention block and a relation-attention block, is integrated into the hashing network. The attention module assigns different weights to the frames according to their importance in determining the content of the video. Experimental results show that the proposed method achieves significant improvements over competing ones.
What problem does this paper attempt to address?