Fine-Grained Instance-Level Sketch-Based Video Retrieval

Peng Xu,Kun Liu,Tao Xiang,Timothy M. Hospedales,Zhanyu Ma,Jun Guo,Yi-Zhe Song
DOI: https://doi.org/10.1109/tcsvt.2020.3014491
IF: 5.859
2021-05-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Existing sketch-analysis work studies sketches depicting static objects or scenes. In this work, we propose a novel cross-modal retrieval problem of fine-grained instance-level sketch-based video retrieval (FG-SBVR), where a sketch sequence is used as a query to retrieve a specific target video instance. Compared with sketch-based still image retrieval, and coarse-grained category-level video retrieval, this is more challenging as both visual appearance and motion need to be simultaneously matched at a fine-grained level. We contribute the first FG-SBVR dataset with rich annotations. We then introduce a novel multi-stream multi-modality deep network to perform FG-SBVR under both strong and weakly supervised settings. The key component of the network is a relation module, designed to prevent model overfitting given scarce training data. We show that this model significantly outperforms a number of existing state-of-the-art models designed for video analysis.
engineering, electrical & electronic
What problem does this paper attempt to address?
The problem that this paper attempts to address is Fine-Grained Instance-Level Sketch-Based Video Retrieval (FG-SBVR). Specifically, the paper aims to use sketch sequences as queries to retrieve specific target video instances. This is more challenging than existing sketch-based static image retrieval or coarse-grained category-level video retrieval because it requires matching both visual appearance and motion information at a fine-grained level. The main contributions of the paper include: 1. Proposing a novel fine-grained instance-level sketch-based video retrieval problem. 2. Contributing the first FG-SBVR dataset with rich annotations, which includes 1,448 sketches corresponding to 528 figure skating video clips. 3. Introducing a novel multi-stream multi-modal deep network to address the FG-SBVR problem, and studying it under both strong supervision and weak supervision settings. 4. Designing a relational module to prevent model overfitting and to effectively train the model even in the case of data scarcity. Through these efforts, the paper significantly enhances the performance of existing technologies in the field of video analysis.