Wavelet Tree Transformer: Multihead Attention With Frequency-Selective Representation and Interaction for Remote Sensing Object Detection

Jiahao Pan,Chu He,Wei Huang,Jidong Cao,Ming Tong
DOI: https://doi.org/10.1109/tgrs.2024.3442575
IF: 8.2
2024-09-02
IEEE Transactions on Geoscience and Remote Sensing
Abstract:Vision Transformer has achieved remarkable success in image recognition tasks owing to its global modeling ability. However, the quadratic computational complexity becomes a prominent issue when dealing with high-resolution remote sensing images. Numerous studies have explored the potential of spectral analysis to reveal for more discriminative features. However, neural network exhibit frequency tendency, and different features are interested in different frequencies. Unfortunately, there is no well-established criterion for selecting appropriate frequency representations. To address these issues, a novel wavelet tree head attention (WTHA-ViT) model is proposed which combines a tree structure on the wavelet frequencies with multihead attention in the Transformer encoder, possessing the ability to interact with cross-combinations of short and long-range as well as high and low-frequency components. First, we construct a wavelet tree reduction module (WTRM) based on the wavelet tree structure, utilizing the wavelet decomposition to retain frequency features suitable for each patch, which enables global modeling with various frequency components while reducing computational complexity. Second, guided by channel correlations, we propose the channel lifting scheme multihead attention (CLSMHA) to model the importance on the heads of multihead attention and focus on the more salient head features. Finally, our WTHA-ViT can replace the backbone of detection networks for dense prediction tasks. Extensive experiments on DOTA-V1.0 and HRSID datasets demonstrate that our model exhibits superior performance and robustness compared to state-of-the-art networks. Besides, we evaluate the transferability of the model on DIOR and LEVIR datasets and verify its generalization ability. The code is available at https://github.com/conquer-pan/WTHA-ViT.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geochemistry & geophysics
What problem does this paper attempt to address?