Abstract:Accurately recognizing a revisited place is crucial for embodied agents to localize and navigate. This requires visual representations to be distinct, despite strong variations in camera viewpoint and scene appearance. Existing visual place recognition pipelines encode the "whole" image and search for matches. This poses a fundamental challenge in matching two images of the same place captured from different camera viewpoints: "the similarity of what overlaps can be dominated by the dissimilarity of what does not overlap". We address this by encoding and searching for "image segments" instead of the whole images. We propose to use open-set image segmentation to decompose an image into `meaningful' entities (i.e., things and stuff). This enables us to create a novel image representation as a collection of multiple overlapping subgraphs connecting a segment with its neighboring segments, dubbed SuperSegment. Furthermore, to efficiently encode these SuperSegments into compact vector representations, we propose a novel factorized representation of feature aggregation. We show that retrieving these partial representations leads to significantly higher recognition recall than the typical whole image based retrieval. Our segments-based approach, dubbed SegVLAD, sets a new state-of-the-art in place recognition on a diverse selection of benchmark datasets, while being applicable to both generic and task-specialized image encoders. Finally, we demonstrate the potential of our method to ``revisit anything'' by evaluating our method on an object instance retrieval task, which bridges the two disparate areas of research: visual place recognition and object-goal navigation, through their common aim of recognizing goal objects specific to a place. Source code: <a class="link-external link-https" href="https://github.com/AnyLoc/Revisit-Anything" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve an important problem in Visual Place Recognition (VPR): how to accurately recognize revisited locations in the presence of significant viewpoint changes. Specifically, existing methods usually encode the entire image and search for matches, which is challenging when dealing with partially overlapping images. Because the similarity of the overlapping parts may be masked by the differences in the non - overlapping parts, leading to matching failures. To solve this problem, the authors propose a method based on image fragment retrieval instead of the traditional whole - image - based method. By using open - set image segmentation techniques to decompose the image into "meaningful" entities (i.e., "objects" and "background"), they create a new image representation - SuperSegment. Each SuperSegment consists of multiple overlapping sub - images that connect a segment and its neighboring segments. In addition, in order to efficiently encode these SuperSegments into a compact vector representation, the authors also propose a novel factorized representation method for feature aggregation. The following are the main contributions of this paper: 1. **SuperSegments**: An image representation method consisting of multiple overlapping sub - images, which can achieve accurate recognition between partially overlapping images. 2. **Factorized representation of feature aggregation**: Effectively combines segment - level information and segment - neighborhood information. 3. **Similarity - weighted ranking - based method**: Converts segment - level retrieval results into image - level retrieval results. Verified by experiments on multiple benchmark datasets, this method not only achieves a higher recognition recall rate under a wide range of viewpoint changes, but also shows its potential in the object instance retrieval task, thus connecting the research fields of visual place recognition and goal - oriented navigation. ### Formula Summary - **SuperSegment Mask Generation Formula**: \[ M_{S\times N}=1(A_{o}^{S\times S}\cdot M_{S\times N}) \] where \(o\geq0\) represents the order of expanding the neighborhood by self - multiplying the adjacency matrix \(A\). - **SuperSegment Descriptor Generation Formula**: \[ F_{S\times D}=1(A_{o}^{S\times S}\cdot M_{S\times N})\cdot T_{N\times D} \] where \(T\) represents the feature matrix to be aggregated. - **Hard - VLAD Residual Feature Matrix**: \[ T_{k}^{N_{k}\times D}=\{\alpha_{k}(f_{p})(f_{p}-c_{k})|\alpha_{k}(f_{p}) = 1\}; \quad N_{k}=\sum_{p}\alpha_{k}(f_{p}) \] where \(\alpha_{k}(f_{p})\in\{0, 1\}\), and is 1 when \(f_{p}\) belongs to the cluster center \(c_{k}\). - **Weighted Frequency Metric in Image Retrieval**: \[ r_{j}^*=\arg\max_{r_{j}}\hat{\theta}(r_{j}); \quad \hat{\theta}(r_{j})=\sum_{s = 1}^{S}\sum_{k = 1}^{K'}\theta_{sk}\cdot1\{r_{sk}=r_{j}\} \] Through these innovations, this paper provides a more robust and efficient solution for visual place recognition, especially in the presence of significant viewpoint changes.

Revisit Anything: Visual Place Recognition via Image Segment Retrieval

SSC: Semantic Scan Context for Large-Scale Place Recognition

AnyLoc: Towards Universal Visual Place Recognition

STV-SC: Segmentation and Temporal Verification Enhanced Scan Context for Place Recognition in Unstructured Environment

Don't Look Back: Robustifying Place Categorization for Viewpoint- and Condition-Invariant Place Recognition

The Revisiting Problem in Simultaneous Localization and Mapping: A Survey on Visual Loop Closure Detection

Fast, Compact and Highly Scalable Visual Place Recognition through Sequence-based Matching of Overloaded Representations

EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition

A Novel Image Descriptor with Aggregated Semantic Skeleton Representation for Long-term Visual Place Recognition

Optimal Transport Aggregation for Visual Place Recognition

Beyond ANN: Exploiting Structural Knowledge for Efficient Place Recognition

SeqNetVLAD vs PointNetVLAD: Image Sequence vs 3D Point Clouds for Day-Night Place Recognition

BEVPlace: Learning LiDAR-based Place Recognition using Bird's Eye View Images

MultiRes-NetVLAD: Augmenting Place Recognition Training with Low-Resolution Imagery

Sequence Searching With Deep-Learnt Depth For Condition-And Viewpointin-Variant Route-Based Place Recognition

Visual place recognition for aerial imagery: A survey

Robust Visual Teach and Repeat for UGVs Using 3D Semantic Maps

Close, But Not There: Boosting Geographic Distance Sensitivity in Visual Place Recognition

Data-efficient Large Scale Place Recognition with Graded Similarity Supervision

ReferEverything: Towards Segmenting Everything We Can Speak of in Videos