Improved Object Detection with Content and Position Separation in Transformer

Yao Wang,Jong-Eun Ha
DOI: https://doi.org/10.3390/rs16020353
IF: 5
2024-01-17
Remote Sensing
Abstract:In object detection, Transformer-based models such as DETR have exhibited state-of-the-art performance, capitalizing on the attention mechanism to handle spatial relations and feature dependencies. One inherent challenge these models face is the intertwined handling of content and positional data within their attention spans, potentially blurring the specificity of the information retrieval process. We consider object detection as a comprehensive task, and simultaneously merging content and positional information like before can exacerbate task complexity. This paper presents the Multi-Task Fusion Detector (MTFD), a novel architecture that innovatively dissects the detection process into distinct tasks, addressing content and position through separate decoders. By utilizing assumed fake queries, the MTFD framework enables each decoder to operate under a presumption of known ancillary information, ensuring more specific and enriched interactions with the feature map. Experimental results affirm that this methodical separation followed by a deliberate fusion not only simplifies the task difficulty of the detection process but also augments accuracy and clarifies the details of each component, providing a fresh perspective on object detection in Transformer-based architectures.
environmental sciences,imaging science & photographic technology,remote sensing,geosciences, multidisciplinary
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in Transformer - based object detection models, content information (such as the class, texture, etc. of the object) and location information (such as the specific position of the object in the image) are intertwined in the attention mechanism, which leads to an increase in task complexity and a blurring of the specificity of the information retrieval process. The author believes that traditional DETR - paradigm methods usually regard object detection as a single task and integrate location and classification information simultaneously when querying the feature map. This makes the object detection task in the middle of the model always a mixed task of location and classification, thereby increasing the task difficulty and may mask the richness of content attributes, such as complex textures, patterns, and color gradients. To solve this problem, the paper proposes the Multi - Task Fusion Detector (MTFD), a novel architecture that decomposes the detection process into different tasks and processes content and location information through independent decoders respectively. By using hypothesized false queries, the MTFD framework enables each decoder to operate under the assumption of known auxiliary information, ensuring more specific and rich interactions with the feature map. This method not only simplifies the task difficulty of the detection process and improves accuracy but also clarifies the details of each component, providing a new perspective for object detection in Transformer - based architectures. Specifically, the main contributions of the paper include: 1. Proposing a pioneering multi - task object detection framework that separates content query tasks and location query tasks, jointly optimizes the object detection task and its subtasks, and ensures that the subtasks do not affect each other. 2. Designing task - specific loss functions and iterative training methods. 3. Comprehensive evaluations on leading datasets have confirmed the model's excellent performance in terms of accuracy, object understanding ability, and scalability, while increasing the interpretability of the internal components of the model. Through these innovations, the paper aims to improve the performance of object detection models, especially when dealing with object detection tasks in complex scenes, and improve the accuracy and robustness of the models.