Abstract:DEtection TRansformer (DETR) for object detection reaches competitive performance compared with Faster R-CNN via a transformer encoder-decoder architecture. However, trained with scratch transformers, DETR needs large-scale training data and an extreme long training schedule even on COCO dataset. Inspired by the great success of pre-training transformers in natural language processing, we propose a novel pretext task named random query patch detection in Unsupervised Pre-training DETR (UP-DETR). Specifically, we randomly crop patches from the given image and then feed them as queries to the decoder. The model is pre-trained to detect these query patches from the input image. During the pre-training, we address two critical issues: multi-task learning and multi-query localization. (1) To trade off classification and localization preferences in the pretext task, we find that freezing the CNN backbone is the prerequisite for the success of pre-training transformers. (2) To perform multi-query localization, we develop UP-DETR with multi-query patch detection with attention mask. Besides, UP-DETR also provides a unified perspective for fine-tuning object detection and one-shot detection tasks. In our experiments, UP-DETR significantly boosts the performance of DETR with faster convergence and higher average precision on object detection, one-shot detection and panoptic segmentation. Code and pre-training models: <a class="link-external link-https" href="https://github.com/dddzg/up-detr" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the performance of Detection Transformer (DETR) in object detection tasks through unsupervised pre - training, especially in the case of small - sized datasets, such as the PASCAL VOC dataset. Although DETR performs well on large - scale datasets (such as COCO), it requires a large amount of training data and an extremely long training time. In addition, DETR has a poor performance on small - scale datasets. To solve these problems, the authors propose Unsupervised Pre - training DETR (UP - DETR). By introducing a novel unsupervised pre - training task - random query patch detection, the Transformer module in DETR is pre - trained, thereby accelerating model convergence and improving detection accuracy.
### Main contributions
1. **Propose a new unsupervised pre - training task**: Random query patch detection, which can effectively pre - train the Transformer module in DETR without any manual annotation.
2. **Solve the problems of multi - task learning and multi - query localization**: By freezing the CNN backbone network and using attention masks, UP - DETR can balance classification and localization tasks during the pre - training process and handle the localization problems of multiple query objects.
3. **Provide a unified fine - tuning perspective**: UP - DETR can be easily fine - tuned into object detection and single - shot detection tasks, only by changing the input of the decoder.
4. **Significantly improve the performance of DETR**: Experimental results show that UP - DETR not only converges faster on the PASCAL VOC and COCO datasets, but also has a higher average precision.
### Pre - training process
- **Encoder part**: Use the CNN backbone network to extract the visual representation of the image, then add two - dimensional position coding and pass it to the multi - layer Transformer encoder.
- **Decoder part**: Randomly crop multiple query patches from the input image, and record their coordinates, widths and heights as ground truths. The features of these query patches are passed to the Transformer decoder, and the decoder is trained to predict the bounding boxes of these query patches in the input image.
### Fine - tuning process
- **Object detection**: Input an image, and the model predicts a set of objects including bounding boxes and categories. The fine - tuning process is the same as that of DETR, using multiple object queries (learnable embeddings) as the input of the decoder.
- **Single - shot detection**: Input an image and a query image, and the model predicts the bounding boxes of objects that are semantically similar to the query image. The query image has its features extracted by the shared CNN and added to all object queries.
### Experimental results
- **PASCAL VOC dataset**: UP - DETR significantly improves the performance of DETR within 150 epochs, with an increase of 6.2% in AP, 5.2% in AP50, and 7.5% in AP75. Even within 300 epochs, the performance improvement is still obvious.
- **COCO dataset**: UP - DETR also performs well on the COCO dataset and significantly improves the performance of DETR.
Through these improvements, UP - DETR not only outperforms the original DETR in performance, but also has a significant improvement in training efficiency.