ENACT: Entropy-based Clustering of Attention Input for Improving the Computational Performance of Object Detection Transformers

Giorgos Savathrakis,Antonis Argyros
2024-09-12
Abstract:Transformers demonstrate competitive performance in terms of precision on the problem of vision-based object detection. However, they require considerable computational resources due to the quadratic size of the attention weights. In this work, we propose to cluster the transformer input on the basis of its entropy. The reason for this is that the self-information of each pixel (whose sum is the entropy), is likely to be similar among pixels corresponding to the same objects. Clustering reduces the size of data given as input to the transformer and therefore reduces training time and GPU memory usage, while at the same time preserves meaningful information to be passed through the remaining parts of the network. The proposed process is organized in a module called ENACT, that can be plugged-in any transformer architecture that consists of a multi-head self-attention computation in its encoder. We ran extensive experiments using the COCO object detection dataset, and three detection transformers. The obtained results demonstrate that in all tested cases, there is consistent reduction in the required computational resources, while the precision of the detection task is only slightly reduced. The code of the ENACT module will become available at <a class="link-external link-https" href="https://github.com/GSavathrakis/ENACT" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of excessive consumption of computational resources in object detection models based on the Transformer architecture. Specifically: 1. **Computational complexity of Transformer**: - The Transformer model performs well in visual object detection tasks, but during the training process, because the weight matrix of the attention mechanism is of quadratic size, i.e., \(O(N^2)\), this leads to a huge demand for computational resources. - Such high complexity not only increases the training time but also significantly increases the usage of GPU memory. 2. **Optimizing computational performance**: - To solve the above problems, the author proposes an entropy - based clustering method, called ENACT (Entropy - based Clustering of Attention Input for Improving the Computational Performance of Object Detection Transformers). - ENACT reduces the scale of input data by clustering the Key and Value in the Transformer input, thereby reducing the computational complexity and memory occupation while trying to retain meaningful information for transmission to subsequent network layers. 3. **Maintaining detection accuracy**: - While reducing the consumption of computational resources, the ENACT module can ensure that the accuracy of the object detection task is only slightly affected. Through experimental verification, this method has shown significant resource - saving effects on multiple detection Transformer models, and the detection accuracy has a slight decline but is still acceptable. ### Formula representation - **Entropy calculation**: \[ H(x)=-\sum_{x}p(x)\log(p(x)) \] where \(p(x)\) is the probability density function of the feature vector \(x\), which is calculated by the linear layer and softmax function: \[ p(x)=\text{softmax}(x\cdot W^T + b) \] where \(W\) is a weight matrix initialized from the Xavier uniform distribution, with a size of \(1\times d\), and \(b\) is a bias term with a size of 1. - **Attention weight calculation**: \[ A = \text{softmax}\left(\frac{Q\cdot K^T}{\sqrt{d}}\right) \] The attention weight after clustering is calculated as: \[ A_{cl}=\text{softmax}\left(\frac{Q\cdot K_{cl}^T}{\sqrt{d}}\right) \] - **Attention map calculation**: \[ A = A\cdot V \] The attention map after clustering is calculated as: \[ A_{cl}=A_{cl}\cdot V_{cl} \] Through these formulas, the ENACT module can significantly reduce the consumption of computational resources while maintaining the detection accuracy.