Abstract:Deep learning is becoming the most widely used technology for multi-sensor data fusion. Semantic correspondence has recently emerged as a foundational task, enabling a range of downstream applications, such as style or appearance transfer, robot manipulation, and pose estimation, through its ability to provide robust correspondence in RGB images with semantic information. However, current representations generated by self-supervised learning and generative models are often limited in their ability to capture and understand the geometric structure of objects, which is significant for matching the correct details in applications of semantic correspondence. Furthermore, efficiently fusing these two types of features presents an interesting challenge. Achieving harmonious integration of these features is crucial for improving the expressive power of models in various tasks. To tackle these issues, our key idea is to integrate depth information from depth estimation or depth sensors into feature maps and leverage learnable weights for feature fusion. First, depth information is used to model pixel-wise depth distributions, assigning relative depth weights to feature maps for perceiving an object's structural information. Then, based on a contrastive learning optimization objective, a series of weights are optimized to leverage feature maps from self-supervised learning and generative models. Depth features are naturally embedded into feature maps, guiding the network to learn geometric structure information about objects and alleviating depth ambiguity issues. Experiments on the SPair-71K and AP-10K datasets show that the proposed method achieves scores of 81.8 and 83.3 on the percentage of correct keypoints (PCK) at the 0.1 level, respectively. Our approach not only demonstrates significant advantages in experimental results but also introduces the depth awareness module and a learnable feature fusion module, which enhances the understanding of object structures through depth information and fully utilizes features from various pre-trained models, offering new possibilities for the application of deep learning in RGB and depth data fusion technologies. We will also continue to focus on accelerating model inference and optimizing model lightweighting, enabling our model to operate at a faster speed.

Integration of Geometric and Perceptual Information for Monocular Depth Estimation

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

OmniFusion: 360 Monocular Depth Estimation via Geometry-Aware Fusion

MFF-Net: Towards Efficient Monocular Depth Completion With Multi-Modal Feature Fusion

Monocular Depth Estimation Based on Multi-Scale Graph Convolution Networks

Neural Contourlet Network for Monocular 360 Depth Estimation

Depth Estimation from Monocular Images Using Dilated Convolution and Uncertainty Learning.

Monocular depth estimation with hierarchical fusion of dilated CNNs and soft-weighted-sum inference

360MonoDepth: High-Resolution 360° Monocular Depth Estimation

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

Monocular Depth Estimation Based on Residual Pooling and Global-Local Feature Fusion

Fast Monocular Depth Estimation via Side Prediction Aggregation with Continuous Spatial Refinement

Self-Supervised Monocular Depth Estimation Based on High-Order Spatial Interactions

Unveiling the Depths: A Multi-Modal Fusion Framework for Challenging Scenarios

RGB-Fusion: Monocular 3D reconstruction with learned depth prediction

A Contour-Aware Monocular Depth Estimation Network using Swin Transformer and Cascaded Multi-scale Fusion

Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes

FusionDepth: Complement Self-Supervised Monocular Depth Estimation with Cost Volume

Multi-feature fusion enhanced monocular depth estimation with boundary awareness

A Depth Awareness and Learnable Feature Fusion Network for Enhanced Geometric Perception in Semantic Correspondence

UniFuse: Unidirectional Fusion for 360° Panorama Depth Estimation