CSR-Net++: Rethinking Context Structure Representation Learning for Feature Matching
Xiaoxian Chen,Jiaxuan Chen
DOI: https://doi.org/10.1109/tgrs.2024.3431008
IF: 8.2
2024-01-01
IEEE Transactions on Geoscience and Remote Sensing
Abstract:Seeking good feature correspondences between two remote sensing (RS) images is an essential and important problem in the fields of RS and photogrammetry. Traditional approaches often necessitate a predefined geometric transformation model or additional manually crafted descriptors, significantly constraining the versatility. In this work, we adopt the recent context structure representation network (CSR-Net), which has shown promising performance in general feature matching problems, and propose modifications, named CSR-Net++, to overcome its main limitations. Specifically, CSR-Net is combined with a PointNet-like geometry estimator, which is sensitive to large deformations, for global preregistration. In addition, CSR-Net learns local consensus representation through a fixed-size grid, leading to limited space-aware capacities due to grid pixelwise max-pooling operations. To tackle the abovementioned limitations, we first introduce a pruning layer for matching guided by global consensus, as opposed to relying on a geometric estimator. In addition, for directly learning consensus representation from points, we propose a modified context structure representation (CSR) learning module including an independent spatial location stream and a stand-alone visual stream (VS). This decomposition separates local consensus into positional consensus and visual consensus. The proposed dual-stream representation learning not only avoids the introduction of grid anchors but also provides visual contextual priors. To demonstrate the robustness and versatility of our CSR-Net++, we conducted comprehensive experiments using diverse sets of real image pairs for general feature matching. The results demonstrate the superiority of our CSR-Net++ in most matching scenarios, achieving a 0.47%-4.70% improvement in F-score for multimodal images over existing leading methods.