RELOCATE: A Simple Training-Free Baseline for Visual Query Localization Using Region-Based Representations

Savya Khosla,Sethuraman T V,Alexander Schwing,Derek Hoiem

2024-12-03

Abstract:We present RELOCATE, a simple training-free baseline designed to perform the challenging task of visual query localization in long videos. To eliminate the need for task-specific training and efficiently handle long videos, RELOCATE leverages a region-based representation derived from pretrained vision models. At a high level, it follows the classic object localization approach: (1) identify all objects in each video frame, (2) compare the objects with the given query and select the most similar ones, and (3) perform bidirectional tracking to get a spatio-temporal response. However, we propose some key enhancements to handle small objects, cluttered scenes, partial visibility, and varying appearances. Notably, we refine the selected objects for accurate localization and generate additional visual queries to capture visual variations. We evaluate RELOCATE on the challenging Ego4D Visual Query 2D Localization dataset, establishing a new baseline that outperforms prior task-specific methods by 49% (relative improvement) in spatio-temporal average precision.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the challenging task of **Visual Query Localization (VQL)**, especially precisely locating the last appearance position of the target object in long - videos. Specifically, the paper proposes a simple and training - free baseline method named **RELOCATE** for visual query localization in long - videos. #### Main challenges of the VQL task: 1. **Open - ended object categories**: Different from traditional object detection models, VQL needs to handle open - ended types of objects rather than fixed categories. 2. **Reference images from outside the video**: The query images (visual query) in VQL usually come from outside the video, so there are no exact or adjacent frames for reliable matching. 3. **Large appearance variations**: The appearance of the target object may change significantly due to factors such as viewing angle, scale, background, illumination, motion blur and occlusion. 4. **Brief appearance**: The target object usually appears briefly in long - videos (for example, less than 0.5 seconds) and may be partially visible. 5. **Complex scenes**: The target object may appear in cluttered scenes or blend in with the background, increasing the difficulty of localization. #### Main contributions of RELOCATE: 1. **Training - free**: RELOCATE is a training - free method, which eliminates the need for large - scale labeled data and can reuse the same video encoding in multiple queries. 2. **Region - based representation**: It utilizes a pre - trained visual model to extract region representations, forming a detailed and compact video encoding, enabling object retrieval to be completed efficiently and robustly through a simple matching function such as cosine similarity. 3. **Enhanced multi - stage framework**: By introducing the refinement step of candidate objects and generating additional visual queries, it improves the accuracy and robustness of localization. #### Experimental results: RELOCATE was evaluated on the Ego4D Visual Query 2D (VQ2D) localization benchmark. Compared with the previous best method, it has a 49% relative improvement in spatio - temporal average precision (stAP 25) and a 33% improvement in time - average precision (tAP 25). In conclusion, by proposing the RELOCATE method, this paper solves many challenges faced by existing VQL methods when dealing with long - videos and achieves significant performance improvement.

RELOCATE: A Simple Training-Free Baseline for Visual Query Localization Using Region-Based Representations

3D Model-free Visual Localization System from Essential Matrix under Local Planar Motion

Leveraging Local Planar Motion Property for Robust Visual Matching and Localization.

Long-Term Map-Based Visual Localization: Analysis of Individual Components of a Hierarchical Pipeline

Single-Stage Visual Query Localization in Egocentric Videos

Reloc3r: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization

Rethinking the Bottom-Up Framework for Query-Based Video Localization

LM-Reloc: Levenberg-Marquardt Based Direct Visual Relocalization

RenderNet: Visual Relocalization Using Virtual Viewpoints in Large-Scale Indoor Environments

Map-free Visual Relocalization: Metric Pose Relative to a Single Image

ReLoc: Indoor Visual Localization with Hierarchical Sitemap and View Synthesis

Feature-based and Convolutional Neural Network Fusion Method for Visual Relocalization

Skimming, Locating, then Perusing: A Human-Like Framework for Natural Language Video Localization

GOReloc: Graph-based Object-Level Relocalization for Visual SLAM

Visual Localization in a Prior 3D LiDAR Map Combining Points and Lines

Lazy Visual Localization via Motion Averaging

Probabilistic Visual Place Recognition for Hierarchical Localization

Bayesian Decision Making to Localize Visual Queries in 2D

VRS-NeRF: Visual Relocalization with Sparse Neural Radiance Field

Robust Visual Teach and Repeat for UGVs Using 3D Semantic Maps

Self-Supervised Camera Relocalization with Hierarchical Fern Encoding