Learning Transformation-Predictive Representations for Detection and Description of Local Features.

Zihao Wang,Chunxu Wu,Yifei Yang,Zhen Li
DOI: https://doi.org/10.1109/cvpr52729.2023.01103
2023-01-01
Computer Vision and Pattern Recognition
Abstract:The task of key-points detection and description is to estimate the stable location and discriminative representation of local features, which is a fundamental task in visual applications. However, either the rough hard positive or negative labels generated from one-to-one correspondences among images may bring indistinguishable samples, like false positives or negatives, which acts as inconsistent supervision. Such resultant false samples mixed with hard samples prevent neural networks from learning descriptions for more accurate matching. To tackle this challenge, we propose to learn the transformation-predictive representations with self-supervised contrastive learning. We maximize the similarity between corresponding views of the same 3D point (landmark) by using none of the negative sample pairs and avoiding collapsing solutions. Furthermore, we adopt self-supervised generation learning and curriculum learning to soften the hard positive labels into soft continuous targets. The aggressively updated soft labels contribute to overcoming the training bottleneck (derived from the label noise of false positives) and facilitating the model training under a stronger transformation paradigm. Our self-supervised training pipeline greatly decreases the computation load and memory usage, and outperforms the sota on the standard image matching benchmarks by noticeable margins, demonstrating excellent generalization capability on multiple downstream tasks.
What problem does this paper attempt to address?