Abstract:Recently, with the popularization of camera tools such as mobile phones and the rise of various short video platforms, a lot of videos are being uploaded to the Internet at all times, for which a video retrieval system with fast retrieval speed and high precision is very necessary. Therefore, content-based video retrieval (CBVR) has aroused the interest of many researchers. A typical CBVR system mainly contains the following two essential parts: video feature extraction and similarity comparison. Feature extraction of video is very challenging, previous video retrieval methods are mostly based on extracting features from single video frames, while resulting the loss of temporal information in the videos. Hashing methods are extensively used in multimedia information retrieval due to its retrieval efficiency, but most of them are currently only applied to image retrieval. In order to solve these problems in video retrieval, we build an end-to-end framework called deep supervised video hashing (DSVH), which employs a 3D convolutional neural network (CNN) to obtain spatial-temporal features of videos, then train a set of hash functions by supervised hashing to transfer the video features into binary space and get the compact binary codes of videos. Finally, we use triplet loss for network training. We conduct a lot of experiments on three public video datasets UCF-101, JHMDB and HMDB-51, and the results show that the proposed method has advantages over many state-of-the-art video retrieval methods. Compared with the DVH method, the mAP value of UCF-101 dataset is improved by 9.3%, and the minimum improvement on JHMDB dataset is also increased by 0.3%. At the same time, we also demonstrate the stability of the algorithm in the HMDB-51 dataset.

Content-Based Video Relevance Prediction with Second-Order Relevance and Attention Modeling

Content-based Video Relevance Prediction Challenge: Data, Protocol, and Baseline

Unsupervised Teacher-Student Model for Large-Scale Video Retrieval.

BERT4SessRec

Semi-Siamese Network for Content-Based Video Relevance Prediction.

Relevance-guided Audio Visual Fusion for Video Saliency Prediction

Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video Retrieval

Towards content-based relevance ranking for video search.

Learning Segment Similarity and Alignment in Large-Scale Content Based Video Retrieval

VRAG: Region Attention Graphs for Content-Based Video Retrieval

Text–video retrieval re-ranking via multi-grained cross attention and frozen image encoders

Exploiting Rich Contents for Personalized Video Recommendation.

A pseudo relevance feedback based cross domain video concept detection

Personalized Video Recommendation Using Rich Contents from Videos

A Supervised Video Hashing Method Based on a Deep 3D Convolutional Neural Network for Large-Scale Video Retrieval

VideoReach

Coarse-to-fine dual-level attention for video-text cross modal retrieval

UATVR: Uncertainty-Adaptive Text-Video Retrieval

Visual Spatio-temporal Relation-enhanced Network for Cross-modal Text-Video Retrieval

Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification

Sec2Sec Co-attention for Video-Based Apparent Affective Prediction