D2Former: Jointly Learning Hierarchical Detectors and Contextual Descriptors Via Agent-based Transformers

Jianfeng He,Yuan Gao,Tianzhu Zhang,Zhe Zhang,Feng Wu
DOI: https://doi.org/10.1109/cvpr52729.2023.00284
2023-01-01
Abstract:Establishing pixel-level matches between image pairs is vital for a variety of computer vision applications. However, achieving robust image matching remains challenging because CNN extracted descriptors usually lack discriminative ability in texture-less regions and keypoint detectors are only good at identifying keypoints with a specific level of structure. To deal with these issues, a novel image matching method is proposed by Jointly Learning Hierarchical Detectors and Contextual Descriptors via Agentbased Transformers (D2Former), including a contextual feature descriptor learning (CFDL) module and a hierarchical keypoint detector learning (HKDL) module. The proposed D2Former enjoys several merits. First, the proposed CFDL module can model long-range contexts efficiently and effectively with the aid of designed descriptor agents. Second, the HKDL module can generate keypoint detectors in a hierarchical way, which is helpful for detecting keypoints with diverse levels of structures. Extensive experimental results on four challenging benchmarks show that our proposed method significantly outperforms state-of-the-art image matching methods.
What problem does this paper attempt to address?