Abstract:Extracting robust and discriminative local features from images plays a vital role for long term visual localization, whose challenges are mainly caused by the severe appearance differences between matching images due to the day-night illuminations, seasonal changes, and human activities. Existing solutions resort to jointly learning both keypoints and their descriptors in an end-to-end manner, leveraged on large number of annotations of point correspondence which are harvested from the structure from motion and depth estimation algorithms. While these methods show improved performance over non-deep methods or those two-stage deep methods, i.e., detection and then description, they are still struggled to conquer the problems encountered in long term visual localization. Since the intrinsic semantics are invariant to the local appearance changes, this paper proposes to learn semantic-aware local features in order to improve robustness of local feature matching for long term localization. Based on a state of the art CNN architecture for local feature learning, i.e., ASLFeat, this paper leverages on the semantic information from an off-the-shelf semantic segmentation network to learn semantic-aware feature maps. The learned correspondence-aware feature descriptors and semantic features are then merged to form the final feature descriptors, for which the improved feature matching ability has been observed in experiments. In addition, the learned semantics embedded in the features can be further used to filter out noisy keypoints, leading to additional accuracy improvement and faster matching speed. Experiments on two popular long term visual localization benchmarks (Aachen Day and Night v1.1, Robotcar Seasons) and one challenging indoor benchmark (InLoc) demonstrate encouraging improvements of the localization accuracy over its counterpart and other competitive methods.

SemFE:A Feature Matching Method for Learnable Local Semantic Feature Enhancement in Multimodal Images

Select & Re-Rank: Effectively and Efficiently Matching Multimodal Data with Dynamically Evolving Attention

Enhancing Unimodal Features Matters: A Multimodal Framework for Building Extraction.

FeMIP: Detector-Free Feature Matching for Multimodal Images with Policy Gradient

Multi-scale Matching Networks for Semantic Correspondence

Ensemble Learning with Advanced Fast Image Filtering Features for Semi-Global Matching

Image Feature Matching Based on Semantic Fusion Description and Spatial Consistency

Learning Semantic-Aware Local Features for Long Term Visual Localization

Leveraging Semantic Cues from Foundation Vision Models for Enhanced Local Feature Correspondence

HomoMatcher: Dense Feature Matching Results with Semi-Dense Efficiency by Homography Estimation

MFEAFN: Multi-scale feature enhanced adaptive fusion network for image semantic segmentation

MLIFeat: Multi-level Information Fusion Based Deep Local Features.

Matching Image with Multiple Local Features

Learning Local Features by Jointly Semantic-guided and Task Rewards

Category-Wise Fusion and Enhancement Learning for Multimodal Remote Sensing Image Semantic Segmentation.

Cross-Modal Hybrid Feature Fusion for Image-Sentence Matching

Deep Semantic Feature Matching Using Confidential Correspondence Consistency

DeepMatcher: A Deep Transformer-based Network for Robust and Accurate Local Feature Matching

Learning Semantic Alignment Using Global Features and Multi-scale Confidence

Multimodal Remote Sensing Image Matching via Learning Features and Attention Mechanism

MambaLF: an Efficient Local Feature Extraction and Matching with State Space Model