Cross Resolution Encoding-Decoding For Detection Transformers

Ashish Kumar,Jaesik Park
2024-10-05
Abstract:Detection Transformers (DETR) are renowned object detection pipelines, however computationally efficient multiscale detection using DETR is still challenging. In this paper, we propose a Cross-Resolution Encoding-Decoding (CRED) mechanism that allows DETR to achieve the accuracy of high-resolution detection while having the speed of low-resolution detection. CRED is based on two modules; Cross Resolution Attention Module (CRAM) and One Step Multiscale Attention (OSMA). CRAM is designed to transfer the knowledge of low-resolution encoder output to a high-resolution feature. While OSMA is designed to fuse multiscale features in a single step and produce a feature map of a desired resolution enriched with multiscale information. When used in prominent DETR methods, CRED delivers accuracy similar to the high-resolution DETR counterpart in roughly 50% fewer FLOPs. Specifically, state-of-the-art DN-DETR, when used with CRED (calling CRED-DETR), becomes 76% faster, with ~50% reduced FLOPs than its high-resolution counterpart with 202 G FLOPs on MS-COCO benchmark. We plan to release pretrained CRED-DETRs for use by the community. Code: <a class="link-external link-https" href="https://github.com/ashishkumar822/CRED-DETR" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the issue of computational inefficiency in multi-scale detection with Detection Transformers (DETR). Specifically, the authors propose a Cross-Resolution Encoding-Decoding (CRED) mechanism that enables DETR to achieve high-resolution detection accuracy while maintaining the speed of low-resolution detection. The CRED mechanism consists of two modules: 1. **Cross-Resolution Attention Module (CRAM)**: Used to transfer knowledge from the low-resolution encoder output to high-resolution features. 2. **One-Step Multi-Scale Attention Module (OSMA)**: Used to fuse multi-scale features in one step and generate feature maps with the desired resolution. By applying these two modules to existing DETR methods, such as DN-DETR, CRED can significantly improve detection accuracy while reducing computational load (FLOPs). For example, in the MS-COCO benchmark, compared to the high-resolution version, CRED can achieve similar accuracy with approximately 50% of the computational load and a speed increase of about 76%.