Abstract:Object detection technology plays a crucial role in people's everyday lives, as well as enterprise production and modern national defense. Most current object detection networks, such as YOLOX, employ convolutional neural networks instead of a Transformer as a backbone. However, these techniques lack a global understanding of the images and may lose meaningful information, such as the precise location of the most active feature detector. Recently, a Transformer with larger receptive fields showed superior performance to corresponding convolutional neural networks in computer vision tasks. The Transformer splits the image into patches and subsequently feeds them to the Transformer in a sequence structure similar to word embeddings. This makes it capable of global modeling of entire images and implies global understanding of images. However, simply using a Transformer with a larger receptive field raises several concerns. For example, self-attention in the Swin Transformer backbone will limit its ability to model long range relations, resulting in poor feature extraction results and low convergence speed during training. To address the above problems, first, we propose an important region-based Reconstructed Deformable Self-Attention that shifts attention to important regions for efficient global modeling. Second, based on the Reconstructed Deformable Self-Attention, we propose the Swin Deformable Transformer backbone, which improves the feature extraction ability and convergence speed. Finally, based on the Swin Deformable Transformer backbone, we propose a novel object detection network, namely, Swin Deformable Transformer-BiPAFPN-YOLOX. experimental results on the COCO dataset show that the training period is reduced by 55.4%, average precision is increased by 2.4%, average precision of small objects is increased by 3.7%, and inference speed is increased by 35%.

SwinSOD: Salient object detection using swin-transformer

Swin Transformer-Based Edge Guidance Network for RGB-D Salient Object Detection

SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection

Object Detection Based on Swin Deformable Transformer-BiPAFPN-YOLOX

DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for Salient Object Detection

SSTrans-Net: Smart Swin Transformer Network for medical image segmentation

DFTR: Depth-supervised Fusion Transformer for Salient Object Detection

P-Swin: Parallel Swin transformer multi-scale semantic segmentation network for land cover classification

EM-Trans: Edge-Aware Multimodal Transformer for RGB-D Salient Object Detection

SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

SwinSUNet: Pure Transformer Network for Remote Sensing Image Change Detection

Swin-Transformer-Based YOLOv5 for Small-Object Detection in Remote Sensing Images

: Edge-Aware Multimodal Transformer for RGB-D Salient Object Detection

An Improved Swin Transformer-Based Model for Remote Sensing Object Detection and Instance Segmentation

SwinFG: A fine-grained recognition scheme based on swin transformer

Semantic feature-guided and correlation-aggregated salient object detection

SFRSwin: A Shallow Significant Feature Retention Swin Transformer for Fine-Grained Image Classification of Wildlife Species.