Abstract:Seeking reliable correspondences then recovering camera poses from a set of putative correspondences extracted from two images of the same scene is a fundamental problem in computer vision. Recent advances have demonstrated that this problem can be effectively solved by using a deep architecture based on the multi-layer perceptron, where the context normalization is designed to make the network permutation-equivariant and embed global information in the sparse point data. However, the context normalization simply normalizes the feature maps according to their distribution and treats each correspondence equally, leading to difficulties in adequately capturing scene geometry encoded by the inliers, especially in case of severe outliers. To address this issue, this paper designs a context-sensitive network based on the self-attention mechanism, termed as correspondence attention transformer (CAT), to enhance the consistent geometry information of inliers and simultaneously suppress outliers during embedding global information. In particular, we design an attention-style structure to aggregate features from all correspondences, i.e., a spatial attention namely CAT-S, which provides each correspondence with information exchange from others in the putative set. To capture the contextual information in a more comprehensive and robust way, we also introduce a multi-head mechanism in our structure to exploit the geometrical context from different aspects. Moreover, considering the high memory request in spatial attention, we propose a covariance normalized channel attention CAT-C in our framework, which can largely reduce the memory consumption and parameter scale, but it asks for eigenvalue decomposition in each attention block thus resulting in more runtime. Anyway, these two attention mechanisms can realize information exchange from the spatial or channel aspect, which both contribute to constructing the geometrical context between inliers and encourage the network to pay more attention to the feature subset about potential inliers. Extensive experiments have been conducted over both indoor and outdoor datasets on the tasks of camera pose estimation, outlier removal, and image registration, which demonstrate the superiority of our method that realizes a large performance improvement compared with the current state-of-the-art approaches.

Attention in Attention: Modeling Context Correlation for Efficient Video Classification

Attention in Attention: Modeling Context Correlation for Efficient Video Classification

Group Contextualization for Video Recognition

PAM: Pyramid Attention Mechanism Based on Contextual Reasoning

Context-aware focal alignment network for micro-video multi-label classification

A Multi-scale Contextual Attention Mechanism for Convolutional Neural Networks

Two-Stream Video Classification with Cross-Modality Attention

Two-stream Collaborative Learning with Spatial-Temporal Attention for Video Classification

Causality Compensated Attention for Contextual Biased Visual Recognition

Coarse-to-fine dual-level attention for video-text cross modal retrieval

Stand-Alone Inter-Frame Attention in Video Models

Efficient Attention: Attention with Linear Complexities

A feature-wise attention module based on the difference with surrounding features for convolutional neural networks

CAT: Learning to Collaborate Channel and Spatial Attention from Multi-Information Fusion

Context-aware Attentional Pooling (CAP) for Fine-grained Visual Classification

Efficient Spatialtemporal Context Modeling for Action Recognition

Correspondence Attention Transformer: A Context-sensitive Network for Two-view Correspondence Learning

Improving 3D Object Detection with Context-Aware and Dimensional Interaction Attention

Video Salient Object Detection via Contrastive Features and Attention Modules

Adaptive Compact Attention For Few-shot Video-to-video Translation

Long-term Temporal Context Gathering for Neural Video Compression