Abstract:Despite recent attention and exploration of depth for various tasks, it is still an unexplored modality for weakly-supervised object detection (WSOD). We propose an amplifier method for enhancing the performance of WSOD by integrating depth information. Our approach can be applied to any WSOD method based on multiple-instance learning, without necessitating additional annotations or inducing large computational expenses. Our proposed method employs a monocular depth estimation technique to obtain hallucinated depth information, which is then incorporated into a Siamese WSOD network using contrastive loss and fusion. By analyzing the relationship between language context and depth, we calculate depth priors to identify the bounding box proposals that may contain an object of interest. These depth priors are then utilized to update the list of pseudo ground-truth boxes, or adjust the confidence of per-box predictions. Our proposed method is evaluated on six datasets (COCO, PASCAL VOC, Conceptual Captions, Clipart1k, Watercolor2k, and Comic2k) by implementing it on top of two state-of-the-art WSOD methods, and we demonstrate a substantial enhancement in performance.

What problem does this paper attempt to address?

The paper primarily focuses on addressing the challenging issue in Weakly-Supervised Object Detection (WSOD), which is how to improve the accuracy of object localization when only image-level labels are provided. Specifically, the authors propose a method to enhance WSOD performance by utilizing depth information (often estimated "hallucinated" depth from monocular images) to improve detection results. Here is a summary of the main problems the paper attempts to solve: 1. **Utilizing Depth Information to Enhance WSOD**: Traditional WSOD methods rely solely on the appearance information from RGB images, which may not be sufficient to accurately locate objects in complex, cluttered environments. The method proposed in this paper aims to improve these methods by integrating depth information to enhance the accuracy of object detection. 2. **No Additional Annotations or Significant Increase in Computational Cost**: For practicality and broad applicability, the method is designed without reliance on additional object bounding box annotations and aims to minimize computational overhead during the training process. 3. **Fusion of Depth Information**: The authors developed a Siamese network architecture based on contrastive loss to fuse information from RGB images with the depth maps predicted from them. This approach not only helps improve feature representation learning but also allows for further combination of detection and classification scores from both modalities through a Late Fusion strategy. 4. **Using Depth Prior Knowledge**: By analyzing the relationship between linguistic context and depth, the approximate depth ranges where different categories of objects are likely to appear are computed, thereby identifying candidate bounding boxes that may contain objects of interest. This depth prior knowledge is used to update the pseudo ground truth bounding box list or to adjust the confidence prediction for each bounding box. 5. **Applicable to Various WSOD Methods**: The proposed method can be applied to different WSOD methods based on multiple instance learning, without the need to modify the existing framework to enhance performance. 6. **Evaluation and Validation**: The method has been tested on multiple datasets (including COCO, PASCAL VOC, and Conceptual Captions) and has achieved significant performance improvements on two state-of-the-art WSOD methods (MIST and SoS-WSOD). In summary, the main goal of this paper is to explore how to effectively integrate depth information into the WSOD task to improve the accuracy and robustness of object detection, especially in situations with limited annotations.

Boosting Weakly Supervised Object Detection using Fusion and Priors from Hallucinated Depth

Spatial Likelihood Voting with Self-Knowledge Distillation for Weakly Supervised Object Detection.

Salient Object Detection with High-Level Prior Based on Bayesian Fusion.

Weakly Supervised Instance Segmentation Using Multi-Prior Fusion.

Depth incorporating with color improves salient object detection

Enhancing Weakly-Supervised Object Detection on Static Images through (Hallucinated) Motion

Depth Awakens: A Depth-perceptual Attention Fusion Network for RGB-D Camouflaged Object Detection

Depth Privileged Object Detection in Indoor Scenes Via Deformation Hallucination.

Boosting Monocular 3D Object Detection with Object-Centric Auxiliary Depth Supervision

Hierarchical Weighting Network with Depth Cues for Real-Time Video Salient Object Detection

Depth-Enhancement Network for Monocular 3D object detection

Boosting Box-supervised Instance Segmentation with Pseudo Depth

HUWSOD: Holistic Self-training for Unified Weakly Supervised Object Detection

ProDepth: Boosting Self-Supervised Multi-Frame Monocular Depth with Probabilistic Fusion

Attention-Based Depth Distillation with 3D-Aware Positional Encoding for Monocular 3D Object Detection

Supervision by Fusion: Towards Unsupervised Learning of Deep Salient Object Detector

Depth Privileged Scene Recognition via Dual Attention Hallucination

Contrast Prior and Fluid Pyramid Integration for RGBD Salient Object Detection

Learning Adaptive Fusion Bank for Multi-modal Salient Object Detection

FD3D: Exploiting Foreground Depth Map for Feature-Supervised Monocular 3D Object Detection