Revisit Anything: Visual Place Recognition via Image Segment Retrieval

Kartik Garg,Sai Shubodh Puligilla,Shishir Kolathaya,Madhava Krishna,Sourav Garg
2024-09-27
Abstract:Accurately recognizing a revisited place is crucial for embodied agents to localize and navigate. This requires visual representations to be distinct, despite strong variations in camera viewpoint and scene appearance. Existing visual place recognition pipelines encode the "whole" image and search for matches. This poses a fundamental challenge in matching two images of the same place captured from different camera viewpoints: "the similarity of what overlaps can be dominated by the dissimilarity of what does not overlap". We address this by encoding and searching for "image segments" instead of the whole images. We propose to use open-set image segmentation to decompose an image into `meaningful' entities (i.e., things and stuff). This enables us to create a novel image representation as a collection of multiple overlapping subgraphs connecting a segment with its neighboring segments, dubbed SuperSegment. Furthermore, to efficiently encode these SuperSegments into compact vector representations, we propose a novel factorized representation of feature aggregation. We show that retrieving these partial representations leads to significantly higher recognition recall than the typical whole image based retrieval. Our segments-based approach, dubbed SegVLAD, sets a new state-of-the-art in place recognition on a diverse selection of benchmark datasets, while being applicable to both generic and task-specialized image encoders. Finally, we demonstrate the potential of our method to ``revisit anything'' by evaluating our method on an object instance retrieval task, which bridges the two disparate areas of research: visual place recognition and object-goal navigation, through their common aim of recognizing goal objects specific to a place. Source code: <a class="link-external link-https" href="https://github.com/AnyLoc/Revisit-Anything" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Information Retrieval,Machine Learning,Robotics
What problem does this paper attempt to address?
This paper attempts to solve an important problem in Visual Place Recognition (VPR): how to accurately recognize revisited locations in the presence of significant viewpoint changes. Specifically, existing methods usually encode the entire image and search for matches, which is challenging when dealing with partially overlapping images. Because the similarity of the overlapping parts may be masked by the differences in the non - overlapping parts, leading to matching failures. To solve this problem, the authors propose a method based on image fragment retrieval instead of the traditional whole - image - based method. By using open - set image segmentation techniques to decompose the image into "meaningful" entities (i.e., "objects" and "background"), they create a new image representation - SuperSegment. Each SuperSegment consists of multiple overlapping sub - images that connect a segment and its neighboring segments. In addition, in order to efficiently encode these SuperSegments into a compact vector representation, the authors also propose a novel factorized representation method for feature aggregation. The following are the main contributions of this paper: 1. **SuperSegments**: An image representation method consisting of multiple overlapping sub - images, which can achieve accurate recognition between partially overlapping images. 2. **Factorized representation of feature aggregation**: Effectively combines segment - level information and segment - neighborhood information. 3. **Similarity - weighted ranking - based method**: Converts segment - level retrieval results into image - level retrieval results. Verified by experiments on multiple benchmark datasets, this method not only achieves a higher recognition recall rate under a wide range of viewpoint changes, but also shows its potential in the object instance retrieval task, thus connecting the research fields of visual place recognition and goal - oriented navigation. ### Formula Summary - **SuperSegment Mask Generation Formula**: \[ M_{S\times N}=1(A_{o}^{S\times S}\cdot M_{S\times N}) \] where \(o\geq0\) represents the order of expanding the neighborhood by self - multiplying the adjacency matrix \(A\). - **SuperSegment Descriptor Generation Formula**: \[ F_{S\times D}=1(A_{o}^{S\times S}\cdot M_{S\times N})\cdot T_{N\times D} \] where \(T\) represents the feature matrix to be aggregated. - **Hard - VLAD Residual Feature Matrix**: \[ T_{k}^{N_{k}\times D}=\{\alpha_{k}(f_{p})(f_{p}-c_{k})|\alpha_{k}(f_{p}) = 1\}; \quad N_{k}=\sum_{p}\alpha_{k}(f_{p}) \] where \(\alpha_{k}(f_{p})\in\{0, 1\}\), and is 1 when \(f_{p}\) belongs to the cluster center \(c_{k}\). - **Weighted Frequency Metric in Image Retrieval**: \[ r_{j}^*=\arg\max_{r_{j}}\hat{\theta}(r_{j}); \quad \hat{\theta}(r_{j})=\sum_{s = 1}^{S}\sum_{k = 1}^{K'}\theta_{sk}\cdot1\{r_{sk}=r_{j}\} \] Through these innovations, this paper provides a more robust and efficient solution for visual place recognition, especially in the presence of significant viewpoint changes.