Multi‐Scale Feature Attention‐DEtection TRansformer: Multi‐Scale Feature Attention for security check object detection

Haifeng Sima,Bailiang Chen,Chaosheng Tang,Yudong Zhang,Junding Sun
DOI: https://doi.org/10.1049/cvi2.12267
IF: 1.484
2024-01-18
IET Computer Vision
Abstract:The authors use dilated convolutions of multi‐scale dilation rates to build a pyramid feature extraction structure and encapsulate the structure into self‐attention. The new attention module is called Multi‐Scale Feature Attention (MSFA). MSFA can fuse object size information into feature maps, alleviating the problems caused by the object size disparity in X‐ray pictures. A module for foreground sequence extraction is proposed. The module combines multi‐branch channel attention and self‐attention to extract important feature sequences in feature maps. These feature sequences as prior knowledge of object queries to avoid semantic gaps. To comprehensively evaluate the performance of Multi‐Scale Feature Attention‐DEtection TRansformer, experiments are conducted on two multi‐category X‐ray datasets. The experimental results show that the performance of the model achieves state‐of‐the‐art on both CLCXray and PIDray datasets. X‐ray security checks aim to detect contraband in luggage; however, the detection accuracy is hindered by the overlapping and significant size differences of objects in X‐ray images. To address these challenges, the authors introduce a novel network model named Multi‐Scale Feature Attention (MSFA)‐DEtection TRansformer (DETR). Firstly, the pyramid feature extraction structure is embedded into the self‐attention module, referred to as the MSFA. Leveraging the MSFA module, MSFA‐DETR extracts multi‐scale feature information and amalgamates them into high‐level semantic features. Subsequently, these features are synergised through attention mechanisms to capture correlations between global information and multi‐scale features. MSFA significantly bolsters the model's robustness across different sizes, thereby enhancing detection accuracy. Simultaneously, A new initialisation method for object queries is proposed. The authors' foreground sequence extraction (FSE) module extracts key feature sequences from feature maps, serving as prior knowledge for object queries. FSE expedites the convergence of the DETR model and elevates detection accuracy. Extensive experimentation validates that this proposed model surpasses state‐of‐the‐art methods on the CLCXray and PIDray datasets.
computer science, artificial intelligence,engineering, electrical & electronic
What problem does this paper attempt to address?