HA-Transformer: Harmonious aggregation from local to global for object detection
Yang Chen,Sihan Chen,Yongqiang Deng,Kunfeng Wang
DOI: https://doi.org/10.1016/j.eswa.2023.120539
IF: 8.5
2023-06-09
Expert Systems with Applications
Abstract:Recently, the Vision Transformer (ViT) with global modeling capability has shown its excellent performance in classification task, which innovates the development direction for a series of vision tasks. However, due to the enormous cost of multi-head self-attention, reducing computational cost while holding the capability of global interaction remains a big challenge. In this paper, we propose a new architecture by establishing an end-to-end connection from local to global via bridge tokens, so that the global interaction is completed at the window level, effectively solving the quadratic complexity problem of transformer. Besides, we consider a hierarchy of information from short-distance to long-distance, which adds a transition module from local to global to make a more harmonious aggregation of information. Our proposed method is named HA-Transformer. The experimental results on COCO dataset show excellent performance of HA-Transformer for object detection, outperforming several state-of-the-art methods.
computer science, artificial intelligence,engineering, electrical & electronic,operations research & management science