Boosting Weakly Supervised Object Detection using Fusion and Priors from Hallucinated Depth

Cagri Gungor,Adriana Kovashka
2023-11-08
Abstract:Despite recent attention and exploration of depth for various tasks, it is still an unexplored modality for weakly-supervised object detection (WSOD). We propose an amplifier method for enhancing the performance of WSOD by integrating depth information. Our approach can be applied to any WSOD method based on multiple-instance learning, without necessitating additional annotations or inducing large computational expenses. Our proposed method employs a monocular depth estimation technique to obtain hallucinated depth information, which is then incorporated into a Siamese WSOD network using contrastive loss and fusion. By analyzing the relationship between language context and depth, we calculate depth priors to identify the bounding box proposals that may contain an object of interest. These depth priors are then utilized to update the list of pseudo ground-truth boxes, or adjust the confidence of per-box predictions. Our proposed method is evaluated on six datasets (COCO, PASCAL VOC, Conceptual Captions, Clipart1k, Watercolor2k, and Comic2k) by implementing it on top of two state-of-the-art WSOD methods, and we demonstrate a substantial enhancement in performance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily focuses on addressing the challenging issue in Weakly-Supervised Object Detection (WSOD), which is how to improve the accuracy of object localization when only image-level labels are provided. Specifically, the authors propose a method to enhance WSOD performance by utilizing depth information (often estimated "hallucinated" depth from monocular images) to improve detection results. Here is a summary of the main problems the paper attempts to solve: 1. **Utilizing Depth Information to Enhance WSOD**: Traditional WSOD methods rely solely on the appearance information from RGB images, which may not be sufficient to accurately locate objects in complex, cluttered environments. The method proposed in this paper aims to improve these methods by integrating depth information to enhance the accuracy of object detection. 2. **No Additional Annotations or Significant Increase in Computational Cost**: For practicality and broad applicability, the method is designed without reliance on additional object bounding box annotations and aims to minimize computational overhead during the training process. 3. **Fusion of Depth Information**: The authors developed a Siamese network architecture based on contrastive loss to fuse information from RGB images with the depth maps predicted from them. This approach not only helps improve feature representation learning but also allows for further combination of detection and classification scores from both modalities through a Late Fusion strategy. 4. **Using Depth Prior Knowledge**: By analyzing the relationship between linguistic context and depth, the approximate depth ranges where different categories of objects are likely to appear are computed, thereby identifying candidate bounding boxes that may contain objects of interest. This depth prior knowledge is used to update the pseudo ground truth bounding box list or to adjust the confidence prediction for each bounding box. 5. **Applicable to Various WSOD Methods**: The proposed method can be applied to different WSOD methods based on multiple instance learning, without the need to modify the existing framework to enhance performance. 6. **Evaluation and Validation**: The method has been tested on multiple datasets (including COCO, PASCAL VOC, and Conceptual Captions) and has achieved significant performance improvements on two state-of-the-art WSOD methods (MIST and SoS-WSOD). In summary, the main goal of this paper is to explore how to effectively integrate depth information into the WSOD task to improve the accuracy and robustness of object detection, especially in situations with limited annotations.