LowFormer: Hardware Efficient Design for Convolutional Transformer Backbones

Moritz Nottebaum,Matteo Dunnhofer,Christian Micheloni

2024-09-05

Abstract:Research in efficient vision backbones is evolving into models that are a mixture of convolutions and transformer blocks. A smart combination of both, architecture-wise and component-wise is mandatory to excel in the speedaccuracy trade-off. Most publications focus on maximizing accuracy and utilize MACs (multiply accumulate operations) as an efficiency metric. The latter however often do not measure accurately how fast a model actually is due to factors like memory access cost and degree of parallelism. We analyzed common modules and architectural design choices for backbones not in terms of MACs, but rather in actual throughput and latency, as the combination of the latter two is a better representation of the efficiency of models in real applications. We applied the conclusions taken from that analysis to create a recipe for increasing hardware-efficiency in macro design. Additionally we introduce a simple slimmed-down version of MultiHead Self-Attention, that aligns with our analysis. We combine both macro and micro design to create a new family of hardware-efficient backbone networks called LowFormer. LowFormer achieves a remarkable speedup in terms of throughput and latency, while achieving similar or better accuracy than current state-of-the-art efficient backbones. In order to prove the generalizability of our hardware-efficient design, we evaluate our method on GPU, mobile GPU and ARM CPU. We further show that the downstream tasks object detection and semantic segmentation profit from our hardware-efficient architecture. Code and models are available at <a class="link-external link-https" href="https://github.com/" rel="external noopener nofollow">this https URL</a> altair199797/LowFormer.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the issue of hardware efficiency in the design of efficient visual backbone networks. Specifically, the authors believe that existing research mostly focuses on maximizing model accuracy and uses Multiply-Accumulate Operations (MACs) as a measure of efficiency. However, this metric does not always accurately reflect the actual running speed of the model. This is because MACs do not take into account factors such as memory access costs and parallelism. Therefore, the main goal of the paper is to create a new hardware-efficient model design method by analyzing the impact of different architectural design choices on actual throughput and latency. Based on these analyses, the authors propose a new model series called LowFormer, which achieves significant improvements in throughput and latency while maintaining or improving accuracy. Additionally, the paper introduces a simplified version of the multi-head self-attention mechanism, further enhancing the model's hardware efficiency. Experimental results show that LowFormer not only performs well on various hardware platforms but also achieves good results in downstream tasks such as object detection and semantic segmentation.

LowFormer: Hardware Efficient Design for Convolutional Transformer Backbones

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization

ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding

A High Performance Reconfigurable Hardware Architecture for Lightweight Convolutional Neural Network

LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference

FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

Effnet: An Efficient Structure for Convolutional Neural Networks

LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

SegFormer3D: an Efficient Transformer for 3D Medical Image Segmentation

Efficient Low-rank Backpropagation for Vision Transformer Adaptation

SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory

Rethinking Vision Transformers for MobileNet Size and Speed

MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory.

A Reconfigurable Spatial Architecture for Energy-Efficient Inception Neural Networks

big.LITTLE Vision Transformer for Efficient Visual Recognition

Vision Transformer Computation and Resilience for Dynamic Inference

InceptionNeXt: When Inception Meets ConvNeXt

GhostNetV2: Enhance Cheap Operation with Long-Range Attention