MS-NetVLAD: Multi-Scale NetVLAD for Visual Place Recognition

Anuradha Uggi,Sumohana S. Channappayya
DOI: https://doi.org/10.1109/lsp.2024.3425279
2024-07-23
IEEE Signal Processing Letters
Abstract:Many successful Visual Place Recognition (VPR) techniques operate in a contrastive learning framework using features extracted from a Convolutional Neural Network (CNN) backbone. Among these, the NetVLAD is a popular framework that transforms the classical Vector of Locally Aggregated Descriptors (VLAD) method into a modern data-driven model. Introducing learnability in VLAD has led to several variants of NetVLAD, such as Patch-NetVLAD. However, many of these use only the bottleneck features of the backbone model, ignoring the rest of the feature hierarchy. A few state-of-the-art models adopt complex architectures to improve the quality of features. In this letter, we propose a simple extension to the NetVLAD that leverages the feature representations from intermediate layers of the CNN backbone in addition to the bottleneck features. We conduct extensive experiments to demonstrate the significance of these intermediate features for VPR. The proposed method, which we call Multi-Scale-NetVLAD (MS-NetVLAD), surpasses the successful NetVLAD and Patch-NetVLAD models by a significant margin. We demonstrate consistent performance improvements on large-scale VPR benchmarks, including Pittsburgh 30 k, Tokyo 24/7, Nordland, and MSLS. This improvement is attributed to the complementary multi-scale features employed by MS-NetVLAD. Importantly, this work reinforces the inherent strength of the NetVLAD framework for VPR. Further, MS-NetVLAD is shown to be competitive with state-of-the-art VPR models such as MixVPR and R2Former.
engineering, electrical & electronic
What problem does this paper attempt to address?