Abstract:Seeking reliable correspondences then recovering camera poses from a set of putative correspondences extracted from two images of the same scene is a fundamental problem in computer vision. Recent advances have demonstrated that this problem can be effectively solved by using a deep architecture based on the multi-layer perceptron, where the context normalization is designed to make the network permutation-equivariant and embed global information in the sparse point data. However, the context normalization simply normalizes the feature maps according to their distribution and treats each correspondence equally, leading to difficulties in adequately capturing scene geometry encoded by the inliers, especially in case of severe outliers. To address this issue, this paper designs a context-sensitive network based on the self-attention mechanism, termed as correspondence attention transformer (CAT), to enhance the consistent geometry information of inliers and simultaneously suppress outliers during embedding global information. In particular, we design an attention-style structure to aggregate features from all correspondences, i.e., a spatial attention namely CAT-S, which provides each correspondence with information exchange from others in the putative set. To capture the contextual information in a more comprehensive and robust way, we also introduce a multi-head mechanism in our structure to exploit the geometrical context from different aspects. Moreover, considering the high memory request in spatial attention, we propose a covariance normalized channel attention CAT-C in our framework, which can largely reduce the memory consumption and parameter scale, but it asks for eigenvalue decomposition in each attention block thus resulting in more runtime. Anyway, these two attention mechanisms can realize information exchange from the spatial or channel aspect, which both contribute to constructing the geometrical context between inliers and encourage the network to pay more attention to the feature subset about potential inliers. Extensive experiments have been conducted over both indoor and outdoor datasets on the tasks of camera pose estimation, outlier removal, and image registration, which demonstrate the superiority of our method that realizes a large performance improvement compared with the current state-of-the-art approaches.

MC-Net: Integrating Multi-level Geometric Context for Two-view Correspondence Learning

CSR-Net++: Rethinking Context Structure Representation Learning for Feature Matching

MCNet: Rethinking the Core Ingredients for Accurate and Efficient Homography Estimation

Disparity Estimation Using Multilevel and Global Information

Spatially-Aware Context Neural Networks.

RANet: A relation-aware network for two-view correspondence learning

Multi-scale Matching Networks for Semantic Correspondence

BCLNet: Bilateral Consensus Learning for Two-View Correspondence Pruning

CSR-Net: Learning Adaptive Context Structure Representation for Robust Feature Correspondence

Correspondence Attention Transformer: A Context-sensitive Network for Two-view Correspondence Learning

PMA-Net: Progressive multi-stage adaptive feature learning for two-view correspondence

Point2CN: Progressive two-view correspondence learning via information fusion

NCMNet: Neighbor Consistency Mining Network for Two-View Correspondence Pruning

Multi-scale inputs and context-aware aggregation network for stereo matching

Progressive correspondence learning by effective multi-channel aggregation

End-to-End Learning of Multi-scale Convolutional Neural Network for Stereo Matching

Learning local descriptors with multi-level feature aggregation and spatial context pyramid

GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond

Multi-Scale Context Attention Network for Stereo Matching

U-Match: Exploring Hierarchy-Aware Local Context for Two-View Correspondence Learning

Learning Two-View Correspondences and Geometry Using Order-Aware Network