Abstract:Weakly supervised learning plays a pivotal role in the field of object detection, i.e . Weakly supervised object detection (WSOD), significantly reducing annotation costs relying on image-level labels. However, WSOD exhibits certain limitations. Typically, they tend to identify the most easily recognizable local regions within targets, posing challenges in accurately delineating the boundaries of targets. Moreover, the presence of multiple instances of the same class in adjacent locations complicates the effective distinction between multiple objects within the same category. On the other hand, the complex backgrounds and dense distribution of targets in remote sensing images (RSI) further exacerbate the difficulty of weakly supervised detection. To address the above issues, we propose a model termed the Multi-View Contextual Adaptation Network (VCANet). Building on the classic Online Instance Classifier Refinement (OICR) framework, we propose to incorporate an contextual adaptation perception, within a multi-view learning framework, and integrate a pseudo-label filtering process. The contextual adaptation perception utilizes the surrounding environment information to enhance localization capabilities, guiding the model to prioritize target objects by referring to their spatially neighbouring pixels. Multi-view learning manufactures additional constraints from diverse perspectives, thereby revealing objects that might be overlooked due to the weak supervision in a single view. The pseudo-label filtering process eliminates inaccurate pseudo-labels by identifying reliable foregrounds to mitigate overlapping proposals during the label propagation. On challenging datasets NWPU VHR-10.v2 and DIOR, we achieve promising results with mAP of 62.3% and 28.2%, respectively, surpassing existing benchmarks.

Advance One-Shot Multispectral Instance Detection With Text's Supervision

Cross-domain Multi-modal Few-shot Object Detection via Rich Text

Multi-view contextual adaptation network for weakly supervised object detection in remote sensing images

ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting

Multispectral Object Detection Based on Multilevel Feature Fusion and Dual Feature Modulation

Adaptive Context- and Scale-Aware Aggregation with Feature Alignment for One-Shot Object Detection.

TMCFN: Text-Supervised Multidimensional Contrastive Fusion Network for Hyperspectral and LiDAR Classification

DiffCLIP: Few-shot Language-driven Multimodal Classifier

DMM: Disparity-guided Multispectral Mamba for Oriented Object Detection in Remote Sensing

Instance Mining with Class Feature Banks for Weakly Supervised Object Detection.

MMF-CLIP: An Image-Text Multimodal Semantic Segmentation Method for Remote Sensing Images

Unsupervised Modality Adaptation with Text-to-Image Diffusion Models for Semantic Segmentation

MOST: A Multi-Oriented Scene Text Detector with Localization Refinement

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

Cross-modality interaction for few-shot multispectral object detection with semantic knowledge

Solo-to-Collaborative Dual-Attention Network for One-Shot Object Detection in Remote Sensing Images

Delving into Out-of-Distribution Detection with Vision-Language Representations

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Multi-Orientation Scene Text Detection with Adaptive Clustering.

Unified Information Fusion Network for Multi-Modal RGB-D and RGB-T Salient Object Detection

CECS-CLIP: Fusing Domain Knowledge for Rare Wildlife Detection Model