Abstract:What causes object detection in video to be less accurate than it is in still images? Because some video frames have degraded in appearance from fast movement, out-of-focus camera shots, and changes in posture. These reasons have made video object detection (VID) a growing area of research in recent years. Video object detection can be used for various healthcare applications, such as detecting and tracking tumors in medical imaging, monitoring the movement of patients in hospitals and long-term care facilities, and analyzing videos of surgeries to improve technique and training. Additionally, it can be used in telemedicine to help diagnose and monitor patients remotely. Existing VID techniques are based on recurrent neural networks or optical flow for feature aggregation to produce reliable features which can be used for detection. Some of those methods aggregate features on the full-sequence level or from nearby frames. To create feature maps, existing VID techniques frequently use Convolutional Neural Networks (CNNs) as the backbone network. On the other hand, Vision Transformers have outperformed CNNs in various vision tasks, including object detection in still images and image classification. We propose in this research to use Swin-Transformer, a state-of-the-art Vision Transformer, as an alternative to CNN-based backbone networks for object detection in videos. The proposed architecture enhances the accuracy of existing VID methods. The ImageNet VID and EPIC KITCHENS datasets are used to evaluate the suggested methodology. We have demonstrated that our proposed method is efficient by achieving 84.3% mean average precision (mAP) on ImageNet VID using less memory in comparison to other leading VID techniques. The source code is available on the website https://github.com/amaharek/SwinVid.

SwinVI:3D Swin Transformer Model with U-net for Video Inpainting.

SwinVid: Enhancing Video Object Detection Using Swin Transformer

SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient object detection

SwinIR: Image Restoration Using Swin Transformer

Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

Deep Transformer Based Video Inpainting Using Fast Fourier Tokenization

Reinforced Swin-Convs Transformer for Underwater Image Enhancement

Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window

SwinMM: Masked Multi-view with Swin Transformers for 3D Medical Image Segmentation

Residual SwinV2 transformer coordinate attention network for image super resolution

SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images

Resolution enhancement processing on low quality images using swin transformer based on interval dense connection strategy

Memorizing Swin-Transformer Denoising Network for Diffusion Model

SwinSUNet: Pure Transformer Network for Remote Sensing Image Change Detection

Learning Joint Spatial-Temporal Transformations for Video Inpainting

Image Super-resolution Reconstruction Network based on Enhanced Swin Transformer via Alternating Aggregation of Local-Global Features

An efficient swin transformer-based method for underwater image enhancement

SwinHCST: a deep learning network architecture for scene classification of remote sensing images based on improved CNN and Transformer

Swin Transformer V2: Scaling Up Capacity and Resolution