Abstract:Object detection plays a crucial role in scene understanding and has extensive practical applications. In the field of remote sensing object detection, both detection accuracy and robustness are of significant concern. Existing methods heavily rely on sophisticated adversarial training strategies that tend to improve robustness at the expense of accuracy. However, detection robustness is not always indicative of improved accuracy. Therefore, in this article, we research how to enhance robustness, while still preserving high accuracy, or even improve both simultaneously, with simple vanilla adversarial training or even in the absence thereof. In pursuit of a solution, we first conduct an exploratory investigation by shifting our attention from adversarial training, referred to as adversarial fine-tuning, to adversarial pretraining. Specifically, we propose a novel pretraining paradigm, namely, structured adversarial self-supervised (SASS) pretraining, to strengthen both clean accuracy and adversarial robustness for object detection in remote sensing images. At a high level, SASS pretraining aims to unify adversarial learning and self-supervised learning into pretraining and encode structured knowledge into pretrained representations for powerful transferability to downstream detection. Moreover, to fully explore the inherent robustness of vision Transformers and facilitate their pretraining efficiency, by leveraging the recent masked image modeling (MIM) as the pretext task, we further instantiate SASS pretraining into a concise end-to-end framework, named structured adversarial MIM (SA-MIM). SA-MIM consists of two pivotal components: structured adversarial attack and structured MIM (S-MIM). The former establishes structured adversaries for the context of adversarial pretraining, while the latter introduces a structured local-sampling global-masking strategy to adapt to hierarchical encoder architectures. Comprehensive experiments on three different datasets have demonstrated the significant superiority of the proposed pretraining paradigm over previous counterparts for remote sensing object detection. More importantly, regardless of with or without adversarial fine-tuning, it enables simultaneous improvements in detection accuracy and robustness as expected, promisingly alleviating the dependence on complicated adversarial fine-tuning.

Aligned Unsupervised Pretraining of Object Detectors with Self-training

Unsupervised Object Detection Pretraining with Joint Object Priors Generation and Detector Learning

DETReg: Unsupervised Pretraining with Region Priors for Object Detection

AlignDet: Aligning Pre-training and Fine-tuning in Object Detection

Rethinking Training from Scratch for Object Detection

Unsupervised Pretraining for Object Detection by Patch Reidentification

Semi-Supervised Self-Training of Object Detection Models

Self-Supervised Pretraining for RGB-D Salient Object Detection

Label-efficient object detection via region proposal network pre-training

Proposal Learning for Semi-Supervised Object Detection

Rethinking Pre-training and Self-training

CISO: Co-iteration Semi-Supervised Learning for Visual Object Detection

ProposalContrast: Unsupervised Pre-training for LiDAR-based 3D Object Detection

Delving into the Pre-training Paradigm of Monocular 3D Object Detection

Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection

Object Detection from Scratch with Deep Supervision

An Analysis of Pre-Training on Object Detection

Self-supervised Training of Proposal-based Segmentation via Background Prediction

Unsupervised learning based object detection using Contrastive Learning

Self-Supervised Pre-Training Joint Framework: Assisting Lightweight Detection Network for Underwater Object Detection

Structured Adversarial Self-Supervised Learning for Robust Object Detection in Remote Sensing Images