Abstract:Extracting robust and discriminative local features from images plays a vital role for long term visual localization, whose challenges are mainly caused by the severe appearance differences between matching images due to the day-night illuminations, seasonal changes, and human activities. Existing solutions resort to jointly learning both keypoints and their descriptors in an end-to-end manner, leveraged on large number of annotations of point correspondence which are harvested from the structure from motion and depth estimation algorithms. While these methods show improved performance over non-deep methods or those two-stage deep methods, i.e., detection and then description, they are still struggled to conquer the problems encountered in long term visual localization. Since the intrinsic semantics are invariant to the local appearance changes, this paper proposes to learn semantic-aware local features in order to improve robustness of local feature matching for long term localization. Based on a state of the art CNN architecture for local feature learning, i.e., ASLFeat, this paper leverages on the semantic information from an off-the-shelf semantic segmentation network to learn semantic-aware feature maps. The learned correspondence-aware feature descriptors and semantic features are then merged to form the final feature descriptors, for which the improved feature matching ability has been observed in experiments. In addition, the learned semantics embedded in the features can be further used to filter out noisy keypoints, leading to additional accuracy improvement and faster matching speed. Experiments on two popular long term visual localization benchmarks (Aachen Day and Night v1.1, Robotcar Seasons) and one challenging indoor benchmark (InLoc) demonstrate encouraging improvements of the localization accuracy over its counterpart and other competitive methods.

Learning Semantic Alignment Using Global Features and Multi-scale Confidence

Learning Visually Aligned Semantic Graph for Cross-Modal Manifold Matching.

Deep Dual-Stream Network with Scale Context Selection Attention Module for Semantic Segmentation

Multi-scale Matching Networks for Semantic Correspondence

Cross-domain Object Detection by Local to Global Object-Aware Feature Alignment

Joint alignment of the distribution in input and feature space for cross-domain aerial image semantic segmentation

Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Multi-level multilingual semantic alignment for zero-shot cross-lingual transfer learning

Semantic Alignment Network for Multi-modal Emotion Recognition

Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment

A Fine-Grained Semantic Alignment Method Specific to Aggregate Multi-Scale Information for Cross-Modal Remote Sensing Image Retrieval

Learning Dual Semantic Relations with Graph Attention for Image-Text Matching

Learning Semantic-Aware Local Features for Long Term Visual Localization

Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment

A Semantic Consistency Feature Alignment Object Detection Model Based on Mixed-Class Distribution Metrics

Visual Content Recognition by Exploiting Semantic Feature Map with Attention and Multi-task Learning

Multi-Scale Fine-Grained Alignments for Image and Sentence Matching

Learning Cross-Channel Representations for Semantic Segmentation

Multi-Stage Network With Geometric Semantic Attention for Two-View Correspondence Learning

Semantic-Aware Fine-Grained Correspondence