Abstract:Transformers demonstrate competitive performance in terms of precision on the problem of vision-based object detection. However, they require considerable computational resources due to the quadratic size of the attention weights. In this work, we propose to cluster the transformer input on the basis of its entropy. The reason for this is that the self-information of each pixel (whose sum is the entropy), is likely to be similar among pixels corresponding to the same objects. Clustering reduces the size of data given as input to the transformer and therefore reduces training time and GPU memory usage, while at the same time preserves meaningful information to be passed through the remaining parts of the network. The proposed process is organized in a module called ENACT, that can be plugged-in any transformer architecture that consists of a multi-head self-attention computation in its encoder. We ran extensive experiments using the COCO object detection dataset, and three detection transformers. The obtained results demonstrate that in all tested cases, there is consistent reduction in the required computational resources, while the precision of the detection task is only slightly reduced. The code of the ENACT module will become available at <a class="link-external link-https" href="https://github.com/GSavathrakis/ENACT" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of excessive consumption of computational resources in object detection models based on the Transformer architecture. Specifically: 1. **Computational complexity of Transformer**: - The Transformer model performs well in visual object detection tasks, but during the training process, because the weight matrix of the attention mechanism is of quadratic size, i.e., \(O(N^2)\), this leads to a huge demand for computational resources. - Such high complexity not only increases the training time but also significantly increases the usage of GPU memory. 2. **Optimizing computational performance**: - To solve the above problems, the author proposes an entropy - based clustering method, called ENACT (Entropy - based Clustering of Attention Input for Improving the Computational Performance of Object Detection Transformers). - ENACT reduces the scale of input data by clustering the Key and Value in the Transformer input, thereby reducing the computational complexity and memory occupation while trying to retain meaningful information for transmission to subsequent network layers. 3. **Maintaining detection accuracy**: - While reducing the consumption of computational resources, the ENACT module can ensure that the accuracy of the object detection task is only slightly affected. Through experimental verification, this method has shown significant resource - saving effects on multiple detection Transformer models, and the detection accuracy has a slight decline but is still acceptable. ### Formula representation - **Entropy calculation**: \[ H(x)=-\sum_{x}p(x)\log(p(x)) \] where \(p(x)\) is the probability density function of the feature vector \(x\), which is calculated by the linear layer and softmax function: \[ p(x)=\text{softmax}(x\cdot W^T + b) \] where \(W\) is a weight matrix initialized from the Xavier uniform distribution, with a size of \(1\times d\), and \(b\) is a bias term with a size of 1. - **Attention weight calculation**: \[ A = \text{softmax}\left(\frac{Q\cdot K^T}{\sqrt{d}}\right) \] The attention weight after clustering is calculated as: \[ A_{cl}=\text{softmax}\left(\frac{Q\cdot K_{cl}^T}{\sqrt{d}}\right) \] - **Attention map calculation**: \[ A = A\cdot V \] The attention map after clustering is calculated as: \[ A_{cl}=A_{cl}\cdot V_{cl} \] Through these formulas, the ENACT module can significantly reduce the consumption of computational resources while maintaining the detection accuracy.

ENACT: Entropy-based Clustering of Attention Input for Improving the Computational Performance of Object Detection Transformers

End-to-End Object Detection with Adaptive Clustering Transformer

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

Towards Data-Efficient Detection Transformers

ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers

Energon: Toward Efficient Acceleration of Transformers Using Dynamic Sparse Attention.

End-to-End Object Detection with Transformers

Efficient Decoder-Free Object Detection with Transformers

Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence

MCANet: Hierarchical cross-fusion lightweight transformer based on multi-ConvHead attention for object detection

CNN-transformer mixed model for object detection

An Empirical Study Of Self-supervised Learning Approaches For Object Detection With Transformers

Adaptive sparse attention-based compact transformer for object tracking

Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection

Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity

TOD-Net: An end-to-end transformer-based object detection network

An Extendable, Efficient and Effective Transformer-based Object Detector

HA-Transformer: Harmonious aggregation from local to global for object detection

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Transformers in Small Object Detection: A Benchmark and Survey of State-of-the-Art

SpeedDETR: Speed-aware Transformers for End-to-end Object Detection.