Object Detection Via Multi-Scale Token Based on Vision Transformer.

Yu Xiao,Tao Qiu,Xinqi Jiang,Qi Yang,Zhaowei Shang,Taiping Zhang
DOI: https://doi.org/10.1109/smc53992.2023.10393952
2023-01-01
Abstract:Visual transformers have achieved impressive performance on object detection. Traditional transformers only focus on multi-scale features between tokens and tokens. However, these methods do not pay attention to the fine-grained features inside a single token, which can lead to the loss of semantic information in the object detection task. To address this issue, we propose a novel network for the above problem, which consists of three components, (1) Internal Multiscale Token Module (IMTM) focuses on the receptive field size of each token and transforms the token dimension size to effectively extract more multiscale features within the self-attention layer, thereby improving the performance and generalization ability of the model. (2) Differential Filter Module (DFM) uses a convolutional network to focus on high-frequency information in the image, helping the Transformer to learn edge features and establish local context, while improving the model performance through residual connections. (3) Feature Fusion Module (FFM) enhances the local and global information extracted by the network by fusing information from different dimensions. Extensive experiments on PASCAL VOC shows that our proposed method can achieve a state-of-the-art performance on object detection.
What problem does this paper attempt to address?