Illicit object detection in X-ray images using Vision Transformers

Jorgen Cani,Ioannis Mademlis,Adamantia Anna Rebolledo Chrysochoou,Georgios Th. Papadopoulos
2024-04-29
Abstract:Illicit object detection is a critical task performed at various high-security locations, including airports, train stations, subways, and ports. The continuous and tedious work of examining thousands of X-ray images per hour can be mentally taxing. Thus, Deep Neural Networks (DNNs) can be used to automate the X-ray image analysis process, improve efficiency and alleviate the security officers' inspection burden. The neural architectures typically utilized in relevant literature are Convolutional Neural Networks (CNNs), with Vision Transformers (ViTs) rarely employed. In order to address this gap, this paper conducts a comprehensive evaluation of relevant ViT architectures on illicit item detection in X-ray images. This study utilizes both Transformer and hybrid backbones, such as SWIN and NextViT, and detectors, such as DINO and RT-DETR. The results demonstrate the remarkable accuracy of the DINO Transformer detector in the low-data regime, the impressive real-time performance of YOLOv8, and the effectiveness of the hybrid NextViT backbone.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are: in high - security places such as airports, railway stations, subway stations and ports, how to use Vision Transformers to automatically detect illegal items in X - ray images. Specifically, this research aims to: 1. **Reduce the burden of manual inspection**: By automating the X - ray image analysis process, reduce the heavy workload of security inspectors who need to manually check thousands of X - ray images per hour, thereby improving work efficiency and reducing wrong decisions caused by fatigue. 2. **Improve detection accuracy**: Evaluate the performance of Vision Transformers (ViTs) and their hybrid architectures (such as SWIN and NextViT) in detecting illegal items in X - ray images, especially in low - data environments. This helps to make up for the deficiency in the existing literature that mainly relies on Convolutional Neural Networks (CNNs) and less uses Transformer architectures. 3. **Achieve real - time performance**: Explore models that can achieve fast inference while ensuring detection accuracy, such as YOLOv8 and RT - DETR, to meet the requirements for real - time performance in practical applications. 4. **Address the challenges specific to X - ray images**: Solve the problems existing in X - ray images, such as the occlusion of stacked objects, complex backgrounds, the distinction of similar objects, and the influence of specific materials on the image appearance, which may be exploited by criminals to hide contraband. To achieve these goals, this research systematically evaluates the performance of different Vision Transformer architectures in the illegal item detection task and verifies their effectiveness and efficiency through experiments.