Abstract:In object detection, Transformer-based models such as DETR have exhibited state-of-the-art performance, capitalizing on the attention mechanism to handle spatial relations and feature dependencies. One inherent challenge these models face is the intertwined handling of content and positional data within their attention spans, potentially blurring the specificity of the information retrieval process. We consider object detection as a comprehensive task, and simultaneously merging content and positional information like before can exacerbate task complexity. This paper presents the Multi-Task Fusion Detector (MTFD), a novel architecture that innovatively dissects the detection process into distinct tasks, addressing content and position through separate decoders. By utilizing assumed fake queries, the MTFD framework enables each decoder to operate under a presumption of known ancillary information, ensuring more specific and enriched interactions with the feature map. Experimental results affirm that this methodical separation followed by a deliberate fusion not only simplifies the task difficulty of the detection process but also augments accuracy and clarifies the details of each component, providing a fresh perspective on object detection in Transformer-based architectures.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in Transformer - based object detection models, content information (such as the class, texture, etc. of the object) and location information (such as the specific position of the object in the image) are intertwined in the attention mechanism, which leads to an increase in task complexity and a blurring of the specificity of the information retrieval process. The author believes that traditional DETR - paradigm methods usually regard object detection as a single task and integrate location and classification information simultaneously when querying the feature map. This makes the object detection task in the middle of the model always a mixed task of location and classification, thereby increasing the task difficulty and may mask the richness of content attributes, such as complex textures, patterns, and color gradients. To solve this problem, the paper proposes the Multi - Task Fusion Detector (MTFD), a novel architecture that decomposes the detection process into different tasks and processes content and location information through independent decoders respectively. By using hypothesized false queries, the MTFD framework enables each decoder to operate under the assumption of known auxiliary information, ensuring more specific and rich interactions with the feature map. This method not only simplifies the task difficulty of the detection process and improves accuracy but also clarifies the details of each component, providing a new perspective for object detection in Transformer - based architectures. Specifically, the main contributions of the paper include: 1. Proposing a pioneering multi - task object detection framework that separates content query tasks and location query tasks, jointly optimizes the object detection task and its subtasks, and ensures that the subtasks do not affect each other. 2. Designing task - specific loss functions and iterative training methods. 3. Comprehensive evaluations on leading datasets have confirmed the model's excellent performance in terms of accuracy, object understanding ability, and scalability, while increasing the interpretability of the internal components of the model. Through these innovations, the paper aims to improve the performance of object detection models, especially when dealing with object detection tasks in complex scenes, and improve the accuracy and robustness of the models.

Improved Object Detection with Content and Position Separation in Transformer

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence

DETR++: Taming Your Multi-Scale Detection Transformer

DA-DETR: Domain Adaptive Detection Transformer with Information Fusion

SeaDATE: Remedy Dual-Attention Transformer with Semantic Alignment via Contrast Learning for Multimodal Object Detection

Focus-Attention Approach in Optimizing DETR for Object Detection from High-Resolution Images

Efficient Decoder-Free Object Detection with Transformers

Anchor DETR: Query Design for Transformer-Based Detector

DETR-ORD: An Improved DETR Detector for Oriented Remote Sensing Object Detection with Feature Reconstruction and Dynamic Query

Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads

Investigating the Robustness and Properties of Detection Transformers (DETR) Toward Difficult Images

Boosting Camouflaged Object Detection with Dual-Task Interactive Transformer

Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection

An Extendable, Efficient and Effective Transformer-based Object Detector

End-to-End Object Detection with Transformers

Cross-domain Detection Transformer based on Spatial-aware and Semantic-aware Token Alignment

Introducing Depth into Transformer-based 3D Object Detection

FP-DETR: Detection Transformer Advanced by Fully Pre-training

Multi Self-supervised Pre-fine-tuned Transformer Fusion for Better Intelligent Transportation Detection

DPDETR: Decoupled Position Detection Transformer for Infrared-Visible Object Detection