Abstract:This paper studies the challenging cross-domain few-shot object detection (CD-FSOD), aiming to develop an accurate object detector for novel domains with minimal labeled examples. While transformer-based open-set detectors, such as DE-ViT, show promise in traditional few-shot object detection, their generalization to CD-FSOD remains unclear: 1) can such open-set detection methods easily generalize to CD-FSOD? 2) If not, how can models be enhanced when facing huge domain gaps? To answer the first question, we employ measures including style, inter-class variance (ICV), and indefinable boundaries (IB) to understand the domain gap. Based on these measures, we establish a new benchmark named CD-FSOD to evaluate object detection methods, revealing that most of the current approaches fail to generalize across domains. Technically, we observe that the performance decline is associated with our proposed measures: style, ICV, and IB. Consequently, we propose several novel modules to address these issues. First, the learnable instance features align initial fixed instances with target categories, enhancing feature distinctiveness. Second, the instance reweighting module assigns higher importance to high-quality instances with slight IB. Third, the domain prompter encourages features resilient to different styles by synthesizing imaginary domains without altering semantic contents. These techniques collectively contribute to the development of the Cross-Domain Vision Transformer for CD-FSOD (CD-ViTO), significantly improving upon the base DE-ViT. Experimental results validate the efficacy of our model.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper investigates the challenges of Cross-Domain Few-Shot Object Detection (CD-FSOD), aiming to develop a model that can accurately detect objects in new domains using only a few annotated examples. Although Transformer-based open-set detectors (such as DE-ViT) perform well in traditional few-shot object detection, their generalization ability in CD-FSOD remains unclear. Specifically, the paper attempts to answer the following two questions: 1. **Can open-set detection methods easily generalize to CD-FSOD?** 2. **If not, how can the model be enhanced when facing significant domain gaps?** To answer the first question, the authors employed various metrics, including style, Inter-Class Variance (ICV), and Indefinable Boundary (IB), to understand the domain gap. Based on these metrics, they established a new benchmark dataset, CD-FSOD, to evaluate object detection methods, and found that most existing methods perform poorly in cross-domain tasks. To answer the second question, the authors proposed several novel modules to address these issues: - **Learnable Instance Features**: Enhancing the discriminative power of features by aligning the initially fixed instance features with the target categories. - **Instance Re-weighting Module**: Assigning higher weights to high-quality instances with slight IB, thereby mitigating the challenges posed by IB. - **Domain Prompting**: Enhancing the model's robustness to different styles by synthesizing virtual "domains" while maintaining semantic content consistency. These techniques collectively contribute to the development of a new Cross-Domain Vision Transformer (CD-ViTO), significantly improving the performance of the baseline DE-ViT. Experimental results validate the effectiveness of this model.

Cross-Domain Few-Shot Object Detection via Enhanced Open-Set Object Detector

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

Joint Feature-Level And Pixel-Level Domain Adaption For Object Detection In The Wild

A broader study of cross-domain few-shot object detection

Few-Shot Object Detection in Unseen Domains

Decoupled DETR For Few-shot Object Detection

Weakly Supervised Few-Shot Object Detection with DETR

Few-Shot Object Detection: Research Advances and Challenges

Context-Transformer: Tackling Object Confusion for Few-Shot Detection

Cross-Domain Hyperspectral Image Classification Based on Transformer

Cross-domain Multi-modal Few-shot Object Detection via Rich Text

Semantic Enhanced Few-shot Object Detection

Few-Shot Object Detection in Remote Sensing: Lifting the Curse of Incompletely Annotated Novel Objects

Few-Shot Object Detection in Remote Sensing Image Interpretation: Opportunities and Challenges

Detect Everything with Few Examples

Few-shot Object Detection via Improved Classification Features

Few-Shot Object Detection with Sparse Context Transformers

Towards Discriminative and Transferable One-Stage Few-Shot Object Detectors

Cross-Domain Few-Shot Segmentation via Iterative Support-Query Correspondence Mining

Few-Shot Object Detection in Remote-Sensing Images via Label-Consistent Classifier and Gradual Regression