RELOCATE: A Simple Training-Free Baseline for Visual Query Localization Using Region-Based Representations

Savya Khosla,Sethuraman T V,Alexander Schwing,Derek Hoiem
2024-12-03
Abstract:We present RELOCATE, a simple training-free baseline designed to perform the challenging task of visual query localization in long videos. To eliminate the need for task-specific training and efficiently handle long videos, RELOCATE leverages a region-based representation derived from pretrained vision models. At a high level, it follows the classic object localization approach: (1) identify all objects in each video frame, (2) compare the objects with the given query and select the most similar ones, and (3) perform bidirectional tracking to get a spatio-temporal response. However, we propose some key enhancements to handle small objects, cluttered scenes, partial visibility, and varying appearances. Notably, we refine the selected objects for accurate localization and generate additional visual queries to capture visual variations. We evaluate RELOCATE on the challenging Ego4D Visual Query 2D Localization dataset, establishing a new baseline that outperforms prior task-specific methods by 49% (relative improvement) in spatio-temporal average precision.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the challenging task of **Visual Query Localization (VQL)**, especially precisely locating the last appearance position of the target object in long - videos. Specifically, the paper proposes a simple and training - free baseline method named **RELOCATE** for visual query localization in long - videos. #### Main challenges of the VQL task: 1. **Open - ended object categories**: Different from traditional object detection models, VQL needs to handle open - ended types of objects rather than fixed categories. 2. **Reference images from outside the video**: The query images (visual query) in VQL usually come from outside the video, so there are no exact or adjacent frames for reliable matching. 3. **Large appearance variations**: The appearance of the target object may change significantly due to factors such as viewing angle, scale, background, illumination, motion blur and occlusion. 4. **Brief appearance**: The target object usually appears briefly in long - videos (for example, less than 0.5 seconds) and may be partially visible. 5. **Complex scenes**: The target object may appear in cluttered scenes or blend in with the background, increasing the difficulty of localization. #### Main contributions of RELOCATE: 1. **Training - free**: RELOCATE is a training - free method, which eliminates the need for large - scale labeled data and can reuse the same video encoding in multiple queries. 2. **Region - based representation**: It utilizes a pre - trained visual model to extract region representations, forming a detailed and compact video encoding, enabling object retrieval to be completed efficiently and robustly through a simple matching function such as cosine similarity. 3. **Enhanced multi - stage framework**: By introducing the refinement step of candidate objects and generating additional visual queries, it improves the accuracy and robustness of localization. #### Experimental results: RELOCATE was evaluated on the Ego4D Visual Query 2D (VQ2D) localization benchmark. Compared with the previous best method, it has a 49% relative improvement in spatio - temporal average precision (stAP 25) and a 33% improvement in time - average precision (tAP 25). In conclusion, by proposing the RELOCATE method, this paper solves many challenges faced by existing VQL methods when dealing with long - videos and achieves significant performance improvement.